Read Data in Batches

Introduction to Data Quality with Great Expectations

Davina Moossazadeh

Data Scientist

Kaggle Weather Data

A pandas DataFrame containing the Kaggle Weather Data, with the following columns: "Location", "Date_Time", "Temperature_C", "Humidity_pct", "Precipitation_mm", and "Wind_Speed_kmh". The DataFrame has 87,118 rows.

Introduction to Data Quality with Great Expectations

Batch Definitions

Batch Definition - A configuration for how a Data Asset should be divided for testing

batch_definition = data_asset.add_batch_definition_whole_dataframe(

name="my_batch_definition" )
print(batch_definition)
id='69e2a81d-1c28-4d1a-b66e-52cdc1198266' 
name='my_batch_definition' 
partitioner=None
1 https://docs.greatexpectations.io/docs/core/connect_to_data/dataframes/
Introduction to Data Quality with Great Expectations

Batches

Batch - A group of records that validations can be run on

batch = batch_definition.get_batch(

batch_parameters={"dataframe": dataframe} )
Introduction to Data Quality with Great Expectations

Batches

data_source_other.jpg

data_source_pandas.jpg

Introduction to Data Quality with Great Expectations

The Batch object

We can use .head() as with pandas:

print(batch.head())

Screenshot 2024-07-16 at 11.49.00.png

1 Table adapted from https://www.kaggle.com/datasets/prasad22/weather-data
Introduction to Data Quality with Great Expectations

The Batch object

print(batch.head(fetch_all=True))

Screenshot 2024-07-22 at 14.02.00.png

Introduction to Data Quality with Great Expectations

The Batch object

.columns() shows all DataFrame columns (note the ())

print(batch.columns())
['Location',
 'Date_Time',
 'Temperature_C',
 'Humidity_pct',
 'Precipitation_mm',
 'Wind_Speed_kmh']
Introduction to Data Quality with Great Expectations

Cheat sheet

Create Batch Definition from Data Asset:

batch_definition = data_asset. \
add_batch_definition_whole_dataframe(
  name: str
)

Create Batch from Batch Definition:

batch = batch_definition.get_batch(
  batch_parameters={"dataframe": dataframe}
)

Get Batch DataFrame rows:

batch.head(fetch_all: bool)  

Get Batch DataFrame column list:

batch.columns()
Introduction to Data Quality with Great Expectations

Let's practice!

Introduction to Data Quality with Great Expectations

Preparing Video For Download...