Advanced data transformation with pandas

ETL and ELT in Python

Jake Roach

Data Engineer

Advanced data transformation with pandas

ETL pipeline with the transform component highlighted.

ETL and ELT in Python

Filling missing values with pandas

timestamps           volume      open     close             
1997-05-15 13:30:00  1443120000  0.121875  0.097917
1997-05-16 13:30:00   294000000  NaN       0.086458
1997-05-19 13:30:00   122136000  0.088021  NaN
# Fill all NaN with value 0
clean_stock_data = raw_stock_data.fillna(value=0)
timestamps           volume      open     close             
1997-05-15 13:30:00  1443120000  0.121875  0.097917
1997-05-16 13:30:00   294000000  0.000000  0.086458
1997-05-19 13:30:00   122136000  0.088021  0.000000
ETL and ELT in Python

Filling missing values with pandas

timestamps           volume      open     close             
1997-05-15 13:30:00  1443120000  0.121875  0.097917
1997-05-16 13:30:00   294000000  NaN       0.086458
1997-05-19 13:30:00   122136000  0.088021  NaN
# Fill NaN values with specific value for each column
clean_stock_data = raw_stock_data.fillna(value={"open": 0, "close": .5}, axis=1)
timestamps           volume      open     close             
1997-05-15 13:30:00  1443120000  0.121875  0.097917
1997-05-16 13:30:00   294000000  0.000000  0.086458
1997-05-19 13:30:00   122136000  0.088021  0.500000
ETL and ELT in Python

Filling missing values with pandas

timestamps           volume      open     close             
1997-05-15 13:30:00  1443120000  0.121875  0.097917
1997-05-16 13:30:00   294000000  NaN       0.086458
1997-05-19 13:30:00   122136000  0.088021  NaN
# Fill NaN value using other columns
raw_stock_data["open"].fillna(raw_stock_data["close"], inplace=True)
timestamps           volume      open     close             
1997-05-15 13:30:00  1443120000  0.121875  0.097917
1997-05-16 13:30:00   294000000  0.086458  0.086458
1997-05-19 13:30:00   122136000  0.088021  NaN
ETL and ELT in Python

Grouping data

SELECT
    ticker,
    AVG(volume),
    AVG(open),
    AVG(close)
FROM raw_stock_data
GROUP BY ticker;

The .groupby() method can recreate the query above, using pandas

ETL and ELT in Python

Grouping data with pandas

ticker  volume        open        close             
AAPL    1443120000    0.121875    0.097917
AAPL     297000000    0.098146    0.086458
AMZN     124186000    0.247511    0.251290
# Use Python to group data by ticker, find the mean of the reamining columns
grouped_stock_data = raw_stock_data.groupby(by=["ticker"], axis=0).mean()
          volume        open        close
ticker               
AAPL      1.149287e+08    34.998377  34.986851
AMZN      1.434213e+08    30.844692  30.830233

Can use .min(), .max() and .sum() to aggregate data

ETL and ELT in Python

Applying advanced transformations to DataFrames

The .apply() method can handle more advanced transformations

def classify_change(row):
    change = row["close"] - row["open"]
    if change > 0:
        return "Increase"
    else:
        return "Decrease"
# Apply transformation to DataFrame
raw_stock_data["change"] = raw_stock_data.apply(
    classify_change, 
    axis=1
)

Before transformation

ticker  ...  open        close             
AAPL         0.121875    0.097917
AAPL         0.098146    0.086458
AMZN         0.247511    0.251290

$$

After transformation

ticker  ...  open        close       change  
AAPL         0.121875    0.097917    Decrease
AAPL         0.098146    0.086458    Decrease
AMZN         0.247511    0.251290    Increase
ETL and ELT in Python

Let's practice!

ETL and ELT in Python

Preparing Video For Download...