Modifying flat file imports

Streamlined Data Ingestion with pandas

Amany Mahfouz

Instructor

U.S. Tax Data

tax_data = pd.read_csv('us_tax_data_2016.csv')

print(tax_data.shape)

(179796, 147)

Limiting Columns

Choose columns to load with the usecols keyword argument
Accepts a list of column numbers or names, or a function to filter column names

col_names = ['STATEFIPS', 'STATE', 'zipcode', 'agi_stub', 'N1']

col_nums = [0, 1, 2, 3, 4]

# Choose columns to load by name
tax_data_v1 = pd.read_csv('us_tax_data_2016.csv', 
                        usecols=col_names)

# Choose columns to load by number
tax_data_v2 = pd.read_csv('us_tax_data_2016.csv',
                        usecols=col_nums)

print(tax_data_v1.equals(tax_data_v2))

True

Limiting Rows

Limit the number of rows loaded with the nrows keyword argument

tax_data_first1000 = pd.read_csv('us_tax_data_2016.csv', nrows=1000)

print(tax_data_first1000.shape)

(1000, 147)

Limiting Rows

Use nrows and skiprows together to process a file in chunks
skiprows accepts a list of row numbers, a number of rows, or a function to filter rows
Set header=None so pandas knows there are no column names

tax_data_next500 = pd.read_csv('us_tax_data_2016.csv', 
                               nrows=500, 
                               skiprows=1000, 
                               header=None)

Limiting Rows

print(tax_data_next500.head(1))

   0   1      2    3     4     5    6    7     8     9     10   ...    136  137   138  139  140  141  142  143   144   145   146
0    1  AL  35565    4   270     0  250    0   210   790   280  ...   1854  260  1978    0    0    0    0   50   222   210   794

[1 rows x 147 columns]

Assigning Column Names

Supply column names by passing a list to the names argument
The list MUST have a name for every column in your data
If you only need to rename a few columns, do it after the import!

Assigning Column Names

col_names = list(tax_data_first1000)

tax_data_next500 = pd.read_csv('us_tax_data_2016.csv',

                               nrows=500, 
                               skiprows=1000,

                               header=None,

                               names=col_names)

print(tax_data_next500.head(1))

   STATEFIPS STATE  zipcode  agi_stub   ...  N11901  A11901  N11902  A11902
0          1    AL    35565         4   ...      50     222     210     794

[1 rows x 147 columns]

Let's practice!

Streamlined Data Ingestion with pandas