Introduction to Flat Files

Streamlined Data Ingestion with pandas

Amany Mahfouz

Instructor

pandas

pandas logo

Streamlined Data Ingestion with pandas

Data Frames

  • pandas-specific structure for two-dimensional data

Dataframe of countries, capital cities, and country area in square miles

Streamlined Data Ingestion with pandas

Data Frames

  • pandas-specific structure for two-dimensional data

Dataframe with column labels (Country, Population, and Area (sq. mi.)) highlighted

Streamlined Data Ingestion with pandas

Data Frames

  • pandas-specific structure for two-dimensional data

Dataframe with row labels/indices highlighted

Streamlined Data Ingestion with pandas

Flat Files

  • Simple, easy-to-produce format
  • Data stored as plain text (no formatting)
  • One row per line
  • Values for different fields are separated by a delimiter
  • Most common flat file type: comma-separated values
  • One pandas function to load them all: read_csv()
Streamlined Data Ingestion with pandas

Loading CSVs

  • Sample of us_tax_data_2016.csv
STATEFIPS,STATE,zipcode,agi_stub,...,N11901,A11901,N11902,A11902
1,AL,0,1,...,63420,51444,711580,1831661
import pandas as pd

tax_data = pd.read_csv("us_tax_data_2016.csv")
tax_data.head(4)
   STATEFIPS STATE  zipcode  agi_stub   ...     N11901  A11901  N11902   A11902
0          1    AL        0         1   ...      63420   51444  711580  1831661
1          1    AL        0         2   ...      74090  110889  416090  1173463
2          1    AL        0         3   ...      64000  143060  195130   543284
3          1    AL        0         4   ...      45020  128920  117410   381329

[4 rows x 147 columns]
Streamlined Data Ingestion with pandas

Loading Other Flat Files

  • Specify a different delimiter with sep
  • Sample of us_tax_data_2016.tsv
    STATEFIPS    STATE    zipcode    agi_stub    ...    N11901    A11901    N11902    A11902
    1    AL    0    1    ...    63420    51444    711580    1831661
    
import pandas as pd

tax_data = pd.read_csv("us_tax_data_2016.tsv", sep="\t")
tax_data.head(3)
   STATEFIPS STATE  zipcode  agi_stub   ...     N11901  A11901  N11902   A11902
0          1    AL        0         1   ...      63420   51444  711580  1831661
1          1    AL        0         2   ...      74090  110889  416090  1173463
2          1    AL        0         3   ...      64000  143060  195130   543284

[3 rows x 147 columns]
Streamlined Data Ingestion with pandas

Let's practice!

Streamlined Data Ingestion with pandas

Preparing Video For Download...