Handling errors and missing data

Streamlined Data Ingestion with pandas

Amany Mahfouz

Instructor

Common Flat File Import Issues

  • Column data types are wrong
  • Values are missing
  • Records that cannot be read by pandas
Streamlined Data Ingestion with pandas

Specifying Data Types

  • pandas automatically infers column data types
print(tax_data.dtypes)
STATEFIPS      int64
STATE         object
zipcode        int64
agi_stub       int64
N1             int64
               ...
N11902         int64
A11902         int64
Length: 147, dtype: object
Streamlined Data Ingestion with pandas

Specifying Data Types

  • Use the dtype keyword argument to specify column data types
  • dtype takes a dictionary of column names and data types
tax_data = pd.read_csv("us_tax_data_2016.csv", dtype={"zipcode": str})

print(tax_data.dtypes)
STATEFIPS      int64
STATE         object
zipcode       object
agi_stub       int64
N1             int64
               ...
N11902         int64
A11902         int64
Length: 147, dtype: object
Streamlined Data Ingestion with pandas

Customizing Missing Data Values

  • pandas automatically interprets some values as missing or NA
print(tax_data.head())
   STATEFIPS STATE  zipcode  agi_stub      N1   ...  A85300  N11901  A11901  N11902   A11902
0          1    AL        0         1  815440   ...       0   63420   51444  711580  1831661
1          1    AL        0         2  495830   ...       0   74090  110889  416090  1173463
2          1    AL        0         3  263390   ...       0   64000  143060  195130   543284
3          1    AL        0         4  167190   ...       0   45020  128920  117410   381329
4          1    AL        0         5  217440   ...      19   82940  423629  126130   506526

[5 rows x 147 columns]
Streamlined Data Ingestion with pandas

Customizing Missing Data Values

  • Use the na_values keyword argument to set custom missing values
  • Can pass a single value, list, or dictionary of columns and values
tax_data = pd.read_csv("us_tax_data_2016.csv", 
                       na_values={"zipcode" : 0})

print(tax_data[tax_data.zipcode.isna()])
        STATEFIPS STATE  zipcode  agi_stub       N1   ...      A85300  N11901    A11901   N11902   A11902
0               1    AL      NaN         1   815440   ...           0   63420     51444   711580  1831661
1               1    AL      NaN         2   495830   ...           0   74090    110889   416090  1173463
2               1    AL      NaN         3   263390   ...           0   64000    143060   195130   543284
...           ...   ...      ...       ...      ...   ...         ...     ...       ...      ...      ...
179034         56    WY      NaN         5    38030   ...         121   13230     73326    22250    99589
179035         56    WY      NaN         6     8480   ...       53835    3630    128149     2250   125557

[306 rows x 147 columns]
Streamlined Data Ingestion with pandas

Lines with Errors

Sample of us_tax_data_2016_corrupt.csv

STATEFIPS,STATE,zipcode,agi_stub,...,N11901,A11901,N11902,A11902
1,AL,0,1,...,63420,51444,711580,1831661
1,AL,0, ,2,...,74090,110889,416090,1173463
tax_data = pd.read_csv("us_tax_data_2016_corrupt.csv")
Streamlined Data Ingestion with pandas
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
    data = pd.read_csv('us_tax_data_2016_corrupt.csv')
  File "<stdin>", line 697, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "<stdin>", line 430, in _read
    data = parser.read(nrows)
  File "<stdin>", line 1134, in read
    ret = self._engine.read(nrows)
  File "<stdin>", line 1990, in read
    data = self._reader.read(nrows)
  File "<stdin>", line 899, in pandas._libs.parsers.TextReader.read
  File "<stdin>", line 914, in pandas._libs.parsers.TextReader._read_low_memory
  File "<stdin>", line 968, in pandas._libs.parsers.TextReader._read_rows
  File "<stdin>", line 955, in pandas._libs.parsers.TextReader._tokenize_rows
  File "<stdin>", line 2172, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 147 fields in line 3, saw 148
Streamlined Data Ingestion with pandas

Lines with Errors

  • Set error_bad_lines=False to skip unparseable records
  • Set warn_bad_lines=True to see messages when records are skipped
tax_data = pd.read_csv("us_tax_data_2016_corrupt.csv",
                       error_bad_lines=False,
                       warn_bad_lines=True)
b'Skipping line 3: expected 147 fields, saw 148\n'
Streamlined Data Ingestion with pandas

Let's practice!

Streamlined Data Ingestion with pandas

Preparing Video For Download...