ETL and ELT in Python
Jake Roach
Data Engineer
Data must be properly transformed to ensure value is provided to downstream users
pandas
provides powerful tools to transform tabular data
.loc[]
.to_datetime()
.loc[]
allows for both dimensions of a DataFrame to be transformed
# Keep only non-zero entries
cleaned = raw_stock_data.loc[raw_stock_data["open"] > 0, :]
# Remove excess columns
cleaned = raw_stock_data.loc[:, ["timestamps", "open", "close"]]
# Combine into one step
cleaned = raw_stock_data.loc[raw_stock_data["open"] > 0, ["timestamps", "open", "close"]]
.iloc[]
uses integer indexing to filter DataFrames
cleaned = raw_stock_data.iloc[[0:50], [0, 1, 2]]
Data types often need to be converted for downstream use cases
.to_datetime()
# "timestamps" column currectly looks like: "20230101085731"
# Convert "timestamps" column to type datetime
cleaned["timestamps"] = pd.to_datetime(cleaned["timestamps"], format="%Y%m%d%H%M%S")
Timestamp('2023-01-01 08:57:31')
# "timestamps" column currently looks like: 1681596000011
# Convert "timestamps" column to type datatime
cleaned["timestamps"] = pd.to_datetime(cleaned["timestamps"], unit="ms")
Timestamp('2023-04-15 22:00:00.011000')
Transforming data comes with risks:
# Several ways to investigate a DataFrame
cleaned = raw_stock_data.loc[raw_stock_data["open"] > 0, ["timestamps", "open", "close"]]
print(cleaned.head())
# Return smallest and largest records
print(cleaned.nsmallest(10, ["timestamps"]))
print(cleaned.nlargest(10, ["timestamps"]))
ETL and ELT in Python