Manually testing a data pipeline

ETL and ELT in Python

Jake Roach

Data Engineer

Testing data pipelines

Data pipelines should be thoroughly tested

  • Validate that data is extracted, transformed, and loaded as expected

$$

Validating pipelines' limits maintenance efforts after deployment

  • Identify and fix data quality issues
  • Improves data reliability

Tools and techniques to test data pipelines

  • End-to-end testing
  • Validating data at "checkpoints"
  • Unit testing
ETL and ELT in Python

Testing and production environments

Test and production environments for building and running data pipelines.

ETL and ELT in Python

Testing a pipeline end-to-end

End-to-end testing of a data pipeline.

End-to-end testing

  • Confirm that pipeline runs on repeated attempts
  • Validate data at pipeline checkpoints
  • Engage in peer review, incorporate feedback
  • Ensure consumer access and satisfaction with solution
ETL and ELT in Python

Validating pipeline checkpoints

# Extract, transform, and load data as part of a pipeline
...

# Take a look at the data made available in a Postgres database
loaded_data = pd.read_sql("SELECT * FROM clean_stock_data", con=db_engine)
print(loaded_data.shape)
(6438, 4)
print(loaded_data.head())
         timestamps      volume      open     close                            
1997-05-15 13:30:00  1443120000  0.121875  0.097917
1997-05-16 13:30:00   294000000  0.098438  0.086458
1997-05-19 13:30:00   122136000  0.088021  0.085417
ETL and ELT in Python

Validating DataFrames

# Extract, transform, and load data, as part of a pipeline
...

# Take a look at the data made available in a Postgres database
loaded_data = pd.read_sql("SELECT * FROM clean_stock_data", con=db_engine)

# Compare the two DataFrames.
print(clean_stock_data.equals(loaded_data))
True
ETL and ELT in Python

Let's practice!

ETL and ELT in Python

Preparing Video For Download...