Extracting data from structured sources

ETL and ELT in Python

Jake Roach

Data Engineer

Source systems

In this course:

  • CSV files
  • Parquet files
  • JSON files
  • SQL databases

Data is also sourced from:

  • APIs
  • Data lakes
  • Data warehouses
  • Web scraping
  • ... and so many more!
ETL and ELT in Python

Reading in parquet files

Parquet files:

  • Open source, column-oriented file format designed for efficient field storage and retrieval
  • Similar to working with CSV files
import pandas as pd

# Read the parquet file into memory
raw_stock_data = pd.read_parquet("raw_stock_data.parquet", engine="fastparquet")
1 https://www.databricks.com/glossary/what-is-parquet
ETL and ELT in Python

Connecting to SQL databases

  • Data can be pulled from SQL databases into a pandas DataFrame
  • Requires a connection URI to build an engine, and connect to the database
import sqlalchemy
import pandas as pd

# Connection URI: schema_identifier://username:password@host:port/db
connection_uri = "postgresql+psycopg2://repl:password@localhost:5432/market"
db_engine = sqlalchemy.create_engine(connection_uri)
# Query the SQL database
raw_stock_data = pd.read_sql("SELECT * FROM raw_stock_data LIMIT 10", db_engine)
ETL and ELT in Python

Modularity

Separating logic into functions

  • Increases readability within a pipeline
  • Adheres to the principle "don't repeat yourself"
  • Expedites troubleshooting
def extract_from_sql(connection_uri, query):
    # Create an engine, query data and return DataFrame
    db_engine = sqlalchemy.create_engine(connection_uri)
    return pd.read_sql(query, db_engine)

extract_from_sql("postgresql+psycopg2://.../market", "SELECT ... LIMIT 10;")
ETL and ELT in Python

Let's practice!

ETL and ELT in Python

Preparing Video For Download...