Modifying imports: parsing dates

Streamlined Data Ingestion with pandas

Amany Mahfouz

Instructor

Date and Time Data

  • Dates and times have their own data type and internal representation
  • Datetime values can be translated into string representations
  • Common set of codes to describe datetime string formatting
Streamlined Data Ingestion with pandas

pandas and Datetimes

  • Datetime columns are loaded as objects (strings) by default
  • Specify that columns have datetimes with the parse_dates argument (not dtype!)
  • parse_dates can accept:
    • a list of column names or numbers to parse
    • a list containing lists of columns to combine and parse
    • a dictionary where keys are new column names and values are lists of columns to parse together
Streamlined Data Ingestion with pandas

pandas and Datetimes

Spreadsheet of date and time columns with different formats

Streamlined Data Ingestion with pandas

pandas and Datetimes

Spreadsheet of date and time data with standard-format columns highlighted

Streamlined Data Ingestion with pandas

pandas and Datetimes

Spreadsheet of date and time data, highlighting a date-only column and a time-only column

Streamlined Data Ingestion with pandas

pandas and Datetimes

Spreadsheet of date and time data, highlighting a column with an unusual datetime format

Streamlined Data Ingestion with pandas

Parsing Dates

# List columns of dates to parse
date_cols = ["Part1StartTime", "Part1EndTime"]

# Load file, parsing standard datetime columns survey_df = pd.read_excel("fcc_survey.xlsx", parse_dates=date_cols)
Streamlined Data Ingestion with pandas

Parsing Dates

# Check data types of timestamp columns
print(survey_df[["Part1StartTime", 
                 "Part1EndTime", 
                 "Part2StartDate", 
                 "Part2StartTime", 
                 "Part2EndTime"]].dtypes)
Part1StartTime    datetime64[ns]
Part1EndTime      datetime64[ns]
Part2StartDate            object
Part2StartTime            object
Part2EndTime              object
dtype: object
Streamlined Data Ingestion with pandas

Parsing Dates

# List columns of dates to parse
date_cols = ["Part1StartTime", 
             "Part1EndTime",

[["Part2StartDate", "Part2StartTime"]]]
# Load file, parsing standard and split datetime columns survey_df = pd.read_excel("fcc_survey.xlsx", parse_dates=date_cols)
print(survey_df.head(3))
  Part2StartDate_Part2StartTime   Age       ...    SchoolMajor  StudentDebtOwe
0           2016-03-29 21:24:57  28.0       ...            NaN           20000
1           2016-03-29 21:27:14  22.0       ...            NaN             NaN
2           2016-03-29 21:27:13  19.0       ...            NaN             NaN

[3 rows x 98 columns]
Streamlined Data Ingestion with pandas

Parsing Dates

# List columns of dates to parse
date_cols = {"Part1Start": "Part1StartTime", 
             "Part1End": "Part1EndTime",
             "Part2Start": ["Part2StartDate", 
                             "Part2StartTime"]}


# Load file, parsing standard and split datetime columns survey_df = pd.read_excel("fcc_survey.xlsx", parse_dates=date_cols) print(survey_df.Part2Start.head(3))
0   2016-03-29 21:24:57
1   2016-03-29 21:27:14
2   2016-03-29 21:27:13
Name: Part2Start, dtype: datetime64[ns]
Streamlined Data Ingestion with pandas

Non-Standard Dates

  • parse_dates doesn't work with non-standard datetime formats
  • Use pd.to_datetime() after loading data if parse_dates won't work
  • to_datetime() arguments:
    • Dataframe and column to convert
    • format: string representation of datetime format
Streamlined Data Ingestion with pandas

Datetime Formatting

  • Describe datetime string formatting with codes and characters
  • Refer to strftime.org for the full list
Streamlined Data Ingestion with pandas

Datetime Formatting

Code Meaning Example
%Y Year (4-digit) 1999
%m Month (zero-padded) 03
%d Day (zero-padded) 01
%H Hour (24-hour clock) 21
%M Minute (zero-padded) 09
%S Second (zero-padded) 05
Streamlined Data Ingestion with pandas

Parsing Non-Standard Dates

Spreadsheet of date and time data, highlighting a column with an unusual datetime format

format_string = "%m%d%Y %H:%M:%S"

survey_df["Part2EndTime"] = pd.to_datetime(survey_df["Part2EndTime"], format=format_string)
Streamlined Data Ingestion with pandas

Parsing Non-Standard Dates

print(survey_df.Part2EndTime.head())
0   2016-03-29 21:27:25
1   2016-03-29 21:29:10
2   2016-03-29 21:28:21
3   2016-03-29 21:30:51
4   2016-03-29 21:31:54
Name: Part2EndTime, dtype: datetime64[ns]
Streamlined Data Ingestion with pandas

Let's practice!

Streamlined Data Ingestion with pandas

Preparing Video For Download...