ETL en ELT in Python
Jake Roach
Data Engineer
Data moet goed getransformeerd worden om downstream-gebruikers waarde te bieden

pandas biedt krachtige tools om tabeldata te transformeren
.loc[].to_datetime().loc[] laat je beide dimensies van een DataFrame transformeren
# Keep only non-zero entries
cleaned = raw_stock_data.loc[raw_stock_data["open"] > 0, :]
# Remove excess columns
cleaned = raw_stock_data.loc[:, ["timestamps", "open", "close"]]
# Combine into one step
cleaned = raw_stock_data.loc[raw_stock_data["open"] > 0, ["timestamps", "open", "close"]]
.iloc[] gebruikt integer-indexering om DataFrames te filteren
cleaned = raw_stock_data.iloc[[0:50], [0, 1, 2]]
Datatypes moeten vaak worden geconverteerd voor downstream-gebruik
.to_datetime()# "timestamps" column currectly looks like: "20230101085731"
# Convert "timestamps" column to type datetime
cleaned["timestamps"] = pd.to_datetime(cleaned["timestamps"], format="%Y%m%d%H%M%S")
Timestamp('2023-01-01 08:57:31')
# "timestamps" column currently looks like: 1681596000011
# Convert "timestamps" column to type datatime
cleaned["timestamps"] = pd.to_datetime(cleaned["timestamps"], unit="ms")
Timestamp('2023-04-15 22:00:00.011000')
Data transformeren brengt risico’s mee:
# Several ways to investigate a DataFrame
cleaned = raw_stock_data.loc[raw_stock_data["open"] > 0, ["timestamps", "open", "close"]]
print(cleaned.head())
# Return smallest and largest records
print(cleaned.nsmallest(10, ["timestamps"]))
print(cleaned.nlargest(10, ["timestamps"]))
ETL en ELT in Python