Data validation best practices
Responsible AI Data Management
Maria Prokofieva
Lead ML engineer
What we will cover
Subgroup analysis
Missing values
Outlier removal
Correcting data inconsistencies
Feature scaling
Feature encoding
Dimensionality reduction
Subgroup analysis
Missing data
Common in large datasets
Data deletion
Imputation strategies and model-based approaches
Subgroup analysis for validation
Outlier removal
Statistical methods like z-scores and IQR, or robust scaling
Validate for fair treatment across data segments
Data inconsistencies
Data quality affects model integrity and reliability
Data standardization and validation rules
Subgroup normalization
Feature scaling
Feature scaling to transform input features
Validate by checking the distributions among groups
Feature encoding
Assess encoding effect on outcomes
Check for biases and information loss
Check for overfitting
Employ regularization and dimensionality reduction
Dimensionality reduction
Reduce input features and preserve essential information
May create bias
Use fairness-conscious techniques like t-SNE
Financial advisor
"Annual income" and "Investment frequency" features
Adjust outliers and scale
Subgroup analysis
Let's practice!
Responsible AI Data Management
Preparing Video For Download...