Transforming data with pandas

ETL and ELT in Python

Jake Roach

Data Engineer

Transforming data in a pipeline

Data must be properly transformed to ensure value is provided to downstream users

A data transformation workflow.

pandas provides powerful tools to transform tabular data

  • .loc[]
  • .to_datetime()
ETL and ELT in Python

Filtering records with .loc[]

.loc[] allows for both dimensions of a DataFrame to be transformed

# Keep only non-zero entries
cleaned = raw_stock_data.loc[raw_stock_data["open"] > 0, :]
# Remove excess columns
cleaned = raw_stock_data.loc[:, ["timestamps", "open", "close"]]
# Combine into one step
cleaned = raw_stock_data.loc[raw_stock_data["open"] > 0, ["timestamps", "open", "close"]]

.iloc[] uses integer indexing to filter DataFrames

cleaned = raw_stock_data.iloc[[0:50], [0, 1, 2]]
ETL and ELT in Python

Altering data types

Data types often need to be converted for downstream use cases

  • .to_datetime()
# "timestamps" column currectly looks like: "20230101085731"
# Convert "timestamps" column to type datetime
cleaned["timestamps"] = pd.to_datetime(cleaned["timestamps"], format="%Y%m%d%H%M%S")
Timestamp('2023-01-01 08:57:31')
# "timestamps" column currently looks like: 1681596000011
# Convert "timestamps" column to type datatime
cleaned["timestamps"] = pd.to_datetime(cleaned["timestamps"], unit="ms")
Timestamp('2023-04-15 22:00:00.011000')
ETL and ELT in Python

Validating transformations

Transforming data comes with risks:

  • Losing information
  • Creating faulty data
# Several ways to investigate a DataFrame
cleaned = raw_stock_data.loc[raw_stock_data["open"] > 0, ["timestamps", "open", "close"]]
print(cleaned.head())
# Return smallest and largest records
print(cleaned.nsmallest(10, ["timestamps"]))
print(cleaned.nlargest(10, ["timestamps"]))
ETL and ELT in Python

Let's practice!

ETL and ELT in Python

Preparing Video For Download...