Data transformeren met pandas

ETL en ELT in Python

Jake Roach

Data Engineer

Data transformeren in een pipeline

Data moet goed getransformeerd worden om downstream-gebruikers waarde te bieden

Een workflow voor datatransformatie.

pandas biedt krachtige tools om tabeldata te transformeren

  • .loc[]
  • .to_datetime()
ETL en ELT in Python

Records filteren met .loc[]

.loc[] laat je beide dimensies van een DataFrame transformeren

# Keep only non-zero entries
cleaned = raw_stock_data.loc[raw_stock_data["open"] > 0, :]
# Remove excess columns
cleaned = raw_stock_data.loc[:, ["timestamps", "open", "close"]]
# Combine into one step
cleaned = raw_stock_data.loc[raw_stock_data["open"] > 0, ["timestamps", "open", "close"]]

.iloc[] gebruikt integer-indexering om DataFrames te filteren

cleaned = raw_stock_data.iloc[[0:50], [0, 1, 2]]
ETL en ELT in Python

Datatypes wijzigen

Datatypes moeten vaak worden geconverteerd voor downstream-gebruik

  • .to_datetime()
# "timestamps" column currectly looks like: "20230101085731"
# Convert "timestamps" column to type datetime
cleaned["timestamps"] = pd.to_datetime(cleaned["timestamps"], format="%Y%m%d%H%M%S")
Timestamp('2023-01-01 08:57:31')
# "timestamps" column currently looks like: 1681596000011
# Convert "timestamps" column to type datatime
cleaned["timestamps"] = pd.to_datetime(cleaned["timestamps"], unit="ms")
Timestamp('2023-04-15 22:00:00.011000')
ETL en ELT in Python

Transformaties valideren

Data transformeren brengt risico’s mee:

  • Informatie verliezen
  • Foutieve data creëren
# Several ways to investigate a DataFrame
cleaned = raw_stock_data.loc[raw_stock_data["open"] > 0, ["timestamps", "open", "close"]]
print(cleaned.head())
# Return smallest and largest records
print(cleaned.nsmallest(10, ["timestamps"]))
print(cleaned.nlargest(10, ["timestamps"]))
ETL en ELT in Python

Laten we oefenen!

ETL en ELT in Python

Preparing Video For Download...