Streamlined Data Ingestion with pandas
Amany Mahfouz
Instructor
pandas
pandas
automatically infers column data typesprint(tax_data.dtypes)
STATEFIPS int64
STATE object
zipcode int64
agi_stub int64
N1 int64
...
N11902 int64
A11902 int64
Length: 147, dtype: object
dtype
keyword argument to specify column data typesdtype
takes a dictionary of column names and data typestax_data = pd.read_csv("us_tax_data_2016.csv", dtype={"zipcode": str})
print(tax_data.dtypes)
STATEFIPS int64
STATE object
zipcode object
agi_stub int64
N1 int64
...
N11902 int64
A11902 int64
Length: 147, dtype: object
pandas
automatically interprets some values as missing or NAprint(tax_data.head())
STATEFIPS STATE zipcode agi_stub N1 ... A85300 N11901 A11901 N11902 A11902
0 1 AL 0 1 815440 ... 0 63420 51444 711580 1831661
1 1 AL 0 2 495830 ... 0 74090 110889 416090 1173463
2 1 AL 0 3 263390 ... 0 64000 143060 195130 543284
3 1 AL 0 4 167190 ... 0 45020 128920 117410 381329
4 1 AL 0 5 217440 ... 19 82940 423629 126130 506526
[5 rows x 147 columns]
na_values
keyword argument to set custom missing valuestax_data = pd.read_csv("us_tax_data_2016.csv", na_values={"zipcode" : 0})
print(tax_data[tax_data.zipcode.isna()])
STATEFIPS STATE zipcode agi_stub N1 ... A85300 N11901 A11901 N11902 A11902
0 1 AL NaN 1 815440 ... 0 63420 51444 711580 1831661
1 1 AL NaN 2 495830 ... 0 74090 110889 416090 1173463
2 1 AL NaN 3 263390 ... 0 64000 143060 195130 543284
... ... ... ... ... ... ... ... ... ... ... ...
179034 56 WY NaN 5 38030 ... 121 13230 73326 22250 99589
179035 56 WY NaN 6 8480 ... 53835 3630 128149 2250 125557
[306 rows x 147 columns]
Sample of us_tax_data_2016_corrupt.csv
STATEFIPS,STATE,zipcode,agi_stub,...,N11901,A11901,N11902,A11902
1,AL,0,1,...,63420,51444,711580,1831661
1,AL,0, ,2,...,74090,110889,416090,1173463
tax_data = pd.read_csv("us_tax_data_2016_corrupt.csv")
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
data = pd.read_csv('us_tax_data_2016_corrupt.csv')
File "<stdin>", line 697, in parser_f
return _read(filepath_or_buffer, kwds)
File "<stdin>", line 430, in _read
data = parser.read(nrows)
File "<stdin>", line 1134, in read
ret = self._engine.read(nrows)
File "<stdin>", line 1990, in read
data = self._reader.read(nrows)
File "<stdin>", line 899, in pandas._libs.parsers.TextReader.read
File "<stdin>", line 914, in pandas._libs.parsers.TextReader._read_low_memory
File "<stdin>", line 968, in pandas._libs.parsers.TextReader._read_rows
File "<stdin>", line 955, in pandas._libs.parsers.TextReader._tokenize_rows
File "<stdin>", line 2172, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 147 fields in line 3, saw 148
error_bad_lines=False
to skip unparseable recordswarn_bad_lines=True
to see messages when records are skippedtax_data = pd.read_csv("us_tax_data_2016_corrupt.csv",
error_bad_lines=False,
warn_bad_lines=True)
b'Skipping line 3: expected 147 fields, saw 148\n'
Streamlined Data Ingestion with pandas