Streamlined Data Ingestion with pandas
Amany Mahfouz
Instructor
parse_dates
argument (not dtype
!)parse_dates
can accept:# List columns of dates to parse date_cols = ["Part1StartTime", "Part1EndTime"]
# Load file, parsing standard datetime columns survey_df = pd.read_excel("fcc_survey.xlsx", parse_dates=date_cols)
# Check data types of timestamp columns
print(survey_df[["Part1StartTime",
"Part1EndTime",
"Part2StartDate",
"Part2StartTime",
"Part2EndTime"]].dtypes)
Part1StartTime datetime64[ns]
Part1EndTime datetime64[ns]
Part2StartDate object
Part2StartTime object
Part2EndTime object
dtype: object
# List columns of dates to parse date_cols = ["Part1StartTime", "Part1EndTime",
[["Part2StartDate", "Part2StartTime"]]]
# Load file, parsing standard and split datetime columns survey_df = pd.read_excel("fcc_survey.xlsx", parse_dates=date_cols)
print(survey_df.head(3))
Part2StartDate_Part2StartTime Age ... SchoolMajor StudentDebtOwe
0 2016-03-29 21:24:57 28.0 ... NaN 20000
1 2016-03-29 21:27:14 22.0 ... NaN NaN
2 2016-03-29 21:27:13 19.0 ... NaN NaN
[3 rows x 98 columns]
# List columns of dates to parse date_cols = {"Part1Start": "Part1StartTime", "Part1End": "Part1EndTime", "Part2Start": ["Part2StartDate", "Part2StartTime"]}
# Load file, parsing standard and split datetime columns survey_df = pd.read_excel("fcc_survey.xlsx", parse_dates=date_cols) print(survey_df.Part2Start.head(3))
0 2016-03-29 21:24:57
1 2016-03-29 21:27:14
2 2016-03-29 21:27:13
Name: Part2Start, dtype: datetime64[ns]
parse_dates
doesn't work with non-standard datetime formatspd.to_datetime()
after loading data if parse_dates
won't workto_datetime()
arguments:format
: string representation of datetime formatCode | Meaning | Example |
---|---|---|
%Y |
Year (4-digit) | 1999 |
%m |
Month (zero-padded) | 03 |
%d |
Day (zero-padded) | 01 |
%H |
Hour (24-hour clock) | 21 |
%M |
Minute (zero-padded) | 09 |
%S |
Second (zero-padded) | 05 |
format_string = "%m%d%Y %H:%M:%S"
survey_df["Part2EndTime"] = pd.to_datetime(survey_df["Part2EndTime"], format=format_string)
print(survey_df.Part2EndTime.head())
0 2016-03-29 21:27:25
1 2016-03-29 21:29:10
2 2016-03-29 21:28:21
3 2016-03-29 21:30:51
4 2016-03-29 21:31:54
Name: Part2EndTime, dtype: datetime64[ns]
Streamlined Data Ingestion with pandas