Introduction to spreadsheets

Streamlined Data Ingestion with pandas

Amany Mahfouz

Instructor

Spreadsheets

  • Also known as Excel files
  • Data stored in tabular form, with cells arranged in rows and columns
  • Unlike flat files, can have formatting and formulas
  • Multiple spreadsheets can exist in a workbook
Streamlined Data Ingestion with pandas

Loading Spreadsheets

  • Spreadsheets have their own loading function in pandas: read_excel()

Screenshot of FreeCodeCamp's New Developer Survey data in a spreadsheet program

Streamlined Data Ingestion with pandas

Loading Spreadsheets

import pandas as pd

# Read the Excel file
survey_data = pd.read_excel("fcc_survey.xlsx")

# View the first 5 lines of data print(survey_data.head())
    Age  AttendedBootcamp       ...                    SchoolMajor  StudentDebtOwe
0  28.0               0.0       ...                            NaN           20000
1  22.0               0.0       ...                            NaN             NaN
2  19.0               0.0       ...                            NaN             NaN
3  26.0               0.0       ...        Cinematography And Film            7000
4  20.0               0.0       ...                            NaN             NaN

[5 rows x 98 columns]
Streamlined Data Ingestion with pandas

Loading Select Columns and Rows

Survey data spreadsheet with metadata header rows

Invoice spreadsheet with multiple small tables of data

Streamlined Data Ingestion with pandas

Loading Select Columns and Rows

  • read_excel() has many keyword arguments in common with read_csv()
    • nrows: limit number of rows to load
    • skiprows: specify number of rows or row numbers to skip
    • usecols: choose columns by name, positional number, or letter (e.g. "A:P")
Streamlined Data Ingestion with pandas

Loading Select Columns and Rows

Streamlined Data Ingestion with pandas

Loading Select Columns and Rows

# Read columns W-AB and AR of file, skipping metadata header
survey_data = pd.read_excel("fcc_survey_with_headers.xlsx",
                            skiprows=2,
                            usecols="W:AB, AR")

# View data print(survey_data.head())
   CommuteTime            CountryCitizen  ...    EmploymentFieldOther    EmploymentStatus   Income
0         35.0  United States of America  ...                     NaN  Employed for wages  32000.0
1         90.0  United States of America  ...                     NaN  Employed for wages  15000.0
2         45.0  United States of America  ...                     NaN  Employed for wages  48000.0
3         45.0  United States of America  ...                     NaN  Employed for wages  43000.0
4         10.0  United States of America  ...                     NaN  Employed for wages   6000.0

[5 rows x 7 columns]
Streamlined Data Ingestion with pandas

Let's practice!

Streamlined Data Ingestion with pandas

Preparing Video For Download...