Introduzione alla Data Quality con Great Expectations
Davina Moossazadeh
Data Scientist
context = gx.get_context()
Crea una Data Source dal Data Context:
data_source = context.data_sources.add_pandas(
name: str
)
Crea un Data Asset dalla Data Source:
data_asset = data_source.add_dataframe_asset(
name: str
)
Crea una Batch Definition dal Data Asset:
batch_definition = data_asset. \
add_batch_definition_whole_dataframe(
name: str
)
Crea un Batch dalla Batch Definition:
batch = batch_definition.get_batch(
batch_parameters={"dataframe": dataframe}
)
$$
Ottieni le righe del DataFrame del Batch:
batch.head(fetch_all: bool)
Ottieni l’elenco delle colonne del DataFrame del Batch:
batch.columns()
Crea un’aspettativa:
gx.expectations.Expect...(...)
Crea un’aspettativa sul numero di righe:
expectation = gx.expectations. \
ExpectTableRowCountToEqual(
value: int
)
$$
Convalida l’aspettativa:
validation_results = batch.validate(
expect=expectation
)
Verifica i risultati:
validation_results.describe()
validation_results.success
validation_results.result
Aspettative sulla forma:
ExpectTableRowCountToEqual(value: int)
ExpectTableRowCountToBeBetween(
min_value: int, max_value: int
)
ExpectTableColumnCountToEqual(
value: int
)
ExpectTableColumnCountToBeBetween(
min_value: int, max_value: int
)
$$
Aspettative sui nomi delle colonne:
ExpectTableColumnsToMatchSet(
column_set: set
)
ExpectColumnToExist(column: str)
Crea una Expectation Suite:
suite = gx.ExpectationSuite(name: str)
Aggiungi un’aspettativa alla suite:
suite.add_expectation(expectation)
Accedi alle aspettative della suite:
suite.expectations
$$
Convalida la Expectation Suite:
validation_results = batch.validate(
expect=suite
)
Verifica i risultati:
validation_results.success
validation_results.describe()
Aggiungi la Expectation Suite al Data Context:
context.suites.add(suite)
Crea una Validation Definition:
validation_definition = \
gx.ValidationDefinition(
name: str,
data=batch_definition,
suite=suite
)
$$
Esegui la convalida:
validation_results = \
validation_definition.run(
batch_parameters={"dataframe": dataframe}
)
Verifica i risultati:
validation_results.success
validation_results.describe()
Aggiungi la Validation Definition al Data Context:
context.validation_definitions.add(
validation_definition
)
Crea un Checkpoint:
checkpoint = gx.Checkpoint(
name: str,
validation_definitions: list,
)
$$
Esegui il Checkpoint:
checkpoint_results = checkpoint.run(
batch_parameters={
"dataframe": dataframe
}
)
Verifica i risultati del Checkpoint:
checkpoint_results.success
Copia un’aspettativa:
expectation_copy = expectation.copy()
expectation_copy.id = None
Verifica se l’aspettativa è nella suite:
expectation in suite.expectations
Elimina un’aspettativa:
suite.delete_expectation(expectation)
$$
Aggiorna il valore dell’aspettativa:
expectation.value = new_value
Salva le modifiche all’aspettativa:
expectation.save()
Salva le modifiche alla Expectation Suite:
suite.save()
Aggiungi un componente al Data Context:
context.data_sources.add(data_source)
context.suites.add(suite)
context.validation_definitions.add(
validation_definition
)
context.checkpoints.add(checkpoint)
$$
Recupera un componente:
.get(name: str)
Elenca i componenti:
.all()
Elimina un componente:
.delete(name: str)
Aspettative a livello di riga:
ExpectColumnValuesToNotBeNull(
column: str
)
ExpectColumnValuesToBeOfType(
column: str, type_: str
)
$$
Aspettative a livello aggregato:
ExpectColumnDistinctValuesToEqualSet(
column: str, value_set: set
)
ExpectColumnUniqueValueCountToBeBetween(
column: str,
min_value: int, max_value: int
)
ExpectColumnValuesToBeUnique(column: str)
ExpectColumnMostCommonValueToBeInSet(
column: str, value_set: set
)
Aspettative numeriche:
ExpectColumn<METRIC>ToBeBetween(
column: str, min_value: int, max_value: int
)
# <METRIC> in
# {"Mean", "Median", "Stdev", "Sum"}
ExpectColumnValuesToBeBetween(
column: str, min_value: int, max_value: int
)
ExpectColumnValuesToBeIncreasing(column: str)
ExpectColumnValuesToBeDecreasing(column: str)
$$
Aspettative su stringhe:
ExpectColumnValueLengthsToEqual(
column: str, value: int
)
ExpectColumnValuesToMatchRegex(
column: str, regex: str
)
ExpectColumnValuesToMatchRegexList(
column: str, regex_list: list
)
ExpectColumnValuesToBe{Dateutil,Json}Parseable(
column: str
)
expectation = gx.expectations.Expect...(
expetation_parameters,
...,
condition_parser='pandas',
row_condition: str,
)
Risorse:
https://docs.greatexpectations.io/docs/core/introduction
https://docs.greatexpectations.io/docs/reference
Galleria delle Expectation:
Introduzione alla Data Quality con Great Expectations