Streamlined Data Ingestion with pandas
Amany Mahfouz
Instructor
tax_data = pd.read_csv('us_tax_data_2016.csv')
print(tax_data.shape)
(179796, 147)
usecols
keyword argumentcol_names = ['STATEFIPS', 'STATE', 'zipcode', 'agi_stub', 'N1']
col_nums = [0, 1, 2, 3, 4]
# Choose columns to load by name tax_data_v1 = pd.read_csv('us_tax_data_2016.csv', usecols=col_names)
# Choose columns to load by number tax_data_v2 = pd.read_csv('us_tax_data_2016.csv', usecols=col_nums)
print(tax_data_v1.equals(tax_data_v2))
True
nrows
keyword argumenttax_data_first1000 = pd.read_csv('us_tax_data_2016.csv', nrows=1000)
print(tax_data_first1000.shape)
(1000, 147)
nrows
and skiprows
together to process a file in chunksskiprows
accepts a list of row numbers, a number of rows, or a function to filter rowsheader=None
so pandas
knows there are no column namestax_data_next500 = pd.read_csv('us_tax_data_2016.csv',
nrows=500,
skiprows=1000,
header=None)
print(tax_data_next500.head(1))
0 1 2 3 4 5 6 7 8 9 10 ... 136 137 138 139 140 141 142 143 144 145 146
0 1 AL 35565 4 270 0 250 0 210 790 280 ... 1854 260 1978 0 0 0 0 50 222 210 794
[1 rows x 147 columns]
names
argumentcol_names = list(tax_data_first1000)
tax_data_next500 = pd.read_csv('us_tax_data_2016.csv',
nrows=500, skiprows=1000,
header=None,
names=col_names) print(tax_data_next500.head(1))
STATEFIPS STATE zipcode agi_stub ... N11901 A11901 N11902 A11902
0 1 AL 35565 4 ... 50 222 210 794
[1 rows x 147 columns]
Streamlined Data Ingestion with pandas