Streamlined Data Ingestion with pandas
Amany Mahfouz
Instructor
pandaspandas automatically infers column data typesprint(tax_data.dtypes)
STATEFIPS      int64
STATE         object
zipcode        int64
agi_stub       int64
N1             int64
               ...
N11902         int64
A11902         int64
Length: 147, dtype: object
dtype keyword argument to specify column data typesdtype takes a dictionary of column names and data typestax_data = pd.read_csv("us_tax_data_2016.csv", dtype={"zipcode": str})print(tax_data.dtypes)
STATEFIPS      int64
STATE         object
zipcode       object
agi_stub       int64
N1             int64
               ...
N11902         int64
A11902         int64
Length: 147, dtype: object
pandas automatically interprets some values as missing or NAprint(tax_data.head())
   STATEFIPS STATE  zipcode  agi_stub      N1   ...  A85300  N11901  A11901  N11902   A11902
0          1    AL        0         1  815440   ...       0   63420   51444  711580  1831661
1          1    AL        0         2  495830   ...       0   74090  110889  416090  1173463
2          1    AL        0         3  263390   ...       0   64000  143060  195130   543284
3          1    AL        0         4  167190   ...       0   45020  128920  117410   381329
4          1    AL        0         5  217440   ...      19   82940  423629  126130   506526
[5 rows x 147 columns]
na_values keyword argument to set custom missing valuestax_data = pd.read_csv("us_tax_data_2016.csv", na_values={"zipcode" : 0})print(tax_data[tax_data.zipcode.isna()])
        STATEFIPS STATE  zipcode  agi_stub       N1   ...      A85300  N11901    A11901   N11902   A11902
0               1    AL      NaN         1   815440   ...           0   63420     51444   711580  1831661
1               1    AL      NaN         2   495830   ...           0   74090    110889   416090  1173463
2               1    AL      NaN         3   263390   ...           0   64000    143060   195130   543284
...           ...   ...      ...       ...      ...   ...         ...     ...       ...      ...      ...
179034         56    WY      NaN         5    38030   ...         121   13230     73326    22250    99589
179035         56    WY      NaN         6     8480   ...       53835    3630    128149     2250   125557
[306 rows x 147 columns]
Sample of us_tax_data_2016_corrupt.csv
STATEFIPS,STATE,zipcode,agi_stub,...,N11901,A11901,N11902,A11902
1,AL,0,1,...,63420,51444,711580,1831661
1,AL,0, ,2,...,74090,110889,416090,1173463
tax_data = pd.read_csv("us_tax_data_2016_corrupt.csv")
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
    data = pd.read_csv('us_tax_data_2016_corrupt.csv')
  File "<stdin>", line 697, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "<stdin>", line 430, in _read
    data = parser.read(nrows)
  File "<stdin>", line 1134, in read
    ret = self._engine.read(nrows)
  File "<stdin>", line 1990, in read
    data = self._reader.read(nrows)
  File "<stdin>", line 899, in pandas._libs.parsers.TextReader.read
  File "<stdin>", line 914, in pandas._libs.parsers.TextReader._read_low_memory
  File "<stdin>", line 968, in pandas._libs.parsers.TextReader._read_rows
  File "<stdin>", line 955, in pandas._libs.parsers.TextReader._tokenize_rows
  File "<stdin>", line 2172, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 147 fields in line 3, saw 148
error_bad_lines=False to skip unparseable recordswarn_bad_lines=True to see messages when records are skippedtax_data = pd.read_csv("us_tax_data_2016_corrupt.csv",
                       error_bad_lines=False,
                       warn_bad_lines=True)
b'Skipping line 3: expected 147 fields, saw 148\n'
Streamlined Data Ingestion with pandas