Introduction to spreadsheets

Streamlined Data Ingestion with pandas

Amany Mahfouz

Instructor

Spreadsheets

Also known as Excel files
Data stored in tabular form, with cells arranged in rows and columns
Unlike flat files, can have formatting and formulas
Multiple spreadsheets can exist in a workbook

Loading Spreadsheets

Spreadsheets have their own loading function in pandas: read_excel()

Screenshot of FreeCodeCamp's New Developer Survey data in a spreadsheet program

Loading Spreadsheets

import pandas as pd

# Read the Excel file
survey_data = pd.read_excel("fcc_survey.xlsx")


# View the first 5 lines of data
print(survey_data.head())

    Age  AttendedBootcamp       ...                    SchoolMajor  StudentDebtOwe
0  28.0               0.0       ...                            NaN           20000
1  22.0               0.0       ...                            NaN             NaN
2  19.0               0.0       ...                            NaN             NaN
3  26.0               0.0       ...        Cinematography And Film            7000
4  20.0               0.0       ...                            NaN             NaN

[5 rows x 98 columns]

Loading Select Columns and Rows

Survey data spreadsheet with metadata header rows

Invoice spreadsheet with multiple small tables of data

Loading Select Columns and Rows

read_excel() has many keyword arguments in common with read_csv()
- nrows: limit number of rows to load
- skiprows: specify number of rows or row numbers to skip
- usecols: choose columns by name, positional number, or letter (e.g. "A:P")

Loading Select Columns and Rows

# Read columns W-AB and AR of file, skipping metadata header
survey_data = pd.read_excel("fcc_survey_with_headers.xlsx",
                            skiprows=2,
                            usecols="W:AB, AR")


# View data
print(survey_data.head())

   CommuteTime            CountryCitizen  ...    EmploymentFieldOther    EmploymentStatus   Income
0         35.0  United States of America  ...                     NaN  Employed for wages  32000.0
1         90.0  United States of America  ...                     NaN  Employed for wages  15000.0
2         45.0  United States of America  ...                     NaN  Employed for wages  48000.0
3         45.0  United States of America  ...                     NaN  Employed for wages  43000.0
4         10.0  United States of America  ...                     NaN  Employed for wages   6000.0

[5 rows x 7 columns]

Let's practice!

Streamlined Data Ingestion with pandas