Persisting data with pandas

ETL and ELT in Python

Jake Roach

Data Engineer

Persisting data in an ETL pipeline

Loading data to a file:

  • Ensures data consumers have stable access to transformed data
  • Occurs as a final step in an ETL process, as well as between discrete steps
  • Captures a "snapshot" of the data
ETL and ELT in Python

Loading data to CSV files using pandas

.to_csv() method

import pandas as pd

# Data extraction and transformation
raw_data = pd.read_csv("raw_stock_data.csv")
stock_data = raw_data.loc[raw_data["open"] > 100, ["timestamps", "open"]]

# Load data to a .csv file
stock_data.to_csv("stock_data.csv")
  • .to_csv called on the DataFrame
  • Writes DataFrame to path "stock_data.csv"
ETL and ELT in Python

Customizing CSV file output

stock_data.to_csv("./stock_data.csv", header=True)
  • Takes True, False or list of string values
stock_data.to_csv("./stock_data.csv", index=True)
  • Takes True or False
  • Determines whether index column is written to the file
stock_data.to_csv("./stock_data.csv", sep="|")
  • Takes string value used to separate columns in the file
  • The | character is a common option

Has counterparts:

  • .to_parquet()
  • .to_json()
  • .to_sql()
1 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html
ETL and ELT in Python

Ensuring data persistence

Was the DataFrame correctly stored to the CSV file?

import pandas
import os  # Import the os module

# Extract, transform and load data
raw_data = pd.read_csv("raw_stock_data.csv")
stock_data = raw_data.loc[raw_data["open"] > 100, ["timestamps", "open"]]
stock_data.to_csv("stock_data.csv")

# Check that the path exists
file_exists = os.path.exists("stock_data.csv")
print(file_exists)
True
ETL and ELT in Python

Let's practice!

ETL and ELT in Python

Preparing Video For Download...