Modifying flat file imports

Streamlined Data Ingestion with pandas

Amany Mahfouz

Instructor

U.S. Tax Data

tax_data = pd.read_csv('us_tax_data_2016.csv')

print(tax_data.shape)
(179796, 147)
Streamlined Data Ingestion with pandas

Limiting Columns

  • Choose columns to load with the usecols keyword argument
  • Accepts a list of column numbers or names, or a function to filter column names
col_names = ['STATEFIPS', 'STATE', 'zipcode', 'agi_stub', 'N1']

col_nums = [0, 1, 2, 3, 4]
# Choose columns to load by name tax_data_v1 = pd.read_csv('us_tax_data_2016.csv', usecols=col_names)
# Choose columns to load by number tax_data_v2 = pd.read_csv('us_tax_data_2016.csv', usecols=col_nums)
print(tax_data_v1.equals(tax_data_v2))
True
Streamlined Data Ingestion with pandas

Limiting Rows

  • Limit the number of rows loaded with the nrows keyword argument
tax_data_first1000 = pd.read_csv('us_tax_data_2016.csv', nrows=1000)

print(tax_data_first1000.shape)
(1000, 147)
Streamlined Data Ingestion with pandas

Limiting Rows

  • Use nrows and skiprows together to process a file in chunks
  • skiprows accepts a list of row numbers, a number of rows, or a function to filter rows
  • Set header=None so pandas knows there are no column names
tax_data_next500 = pd.read_csv('us_tax_data_2016.csv', 
                               nrows=500, 
                               skiprows=1000, 
                               header=None)
Streamlined Data Ingestion with pandas

Limiting Rows

print(tax_data_next500.head(1))
   0   1      2    3     4     5    6    7     8     9     10   ...    136  137   138  139  140  141  142  143   144   145   146
0    1  AL  35565    4   270     0  250    0   210   790   280  ...   1854  260  1978    0    0    0    0   50   222   210   794

[1 rows x 147 columns]
Streamlined Data Ingestion with pandas

Assigning Column Names

  • Supply column names by passing a list to the names argument
  • The list MUST have a name for every column in your data
  • If you only need to rename a few columns, do it after the import!
Streamlined Data Ingestion with pandas

Assigning Column Names

col_names = list(tax_data_first1000)

tax_data_next500 = pd.read_csv('us_tax_data_2016.csv',
nrows=500, skiprows=1000,
header=None,
names=col_names) print(tax_data_next500.head(1))
   STATEFIPS STATE  zipcode  agi_stub   ...  N11901  A11901  N11902  A11902
0          1    AL    35565         4   ...      50     222     210     794

[1 rows x 147 columns]
Streamlined Data Ingestion with pandas

Let's practice!

Streamlined Data Ingestion with pandas

Preparing Video For Download...