Data validation best practices

Responsible AI Data Management

Maria Prokofieva

Lead ML engineer

What we will cover

  • Subgroup analysis
  • Missing values
  • Outlier removal
  • Correcting data inconsistencies
  • Feature scaling
  • Feature encoding
  • Dimensionality reduction
Responsible AI Data Management

Subgroup analysis

Step 1: Divide into subgroups based on protected charateristics

Step 2: Evaluate each subgroup's statistical distribution and model performance

Step 3: Evaluate model fairness metrics for each subgroup

Step 4: Apply mitigation strategies

Responsible AI Data Management

Missing data

  • Common in large datasets
  • Data deletion
  • Imputation strategies and model-based approaches
  • Subgroup analysis for validation

missing data

Responsible AI Data Management

Outlier removal

  • Statistical methods like z-scores and IQR, or robust scaling
  • Validate for fair treatment across data segments

outlier removal

Responsible AI Data Management

Data inconsistencies

  • Data quality affects model integrity and reliability
  • Data standardization and validation rules
  • Subgroup normalization
Responsible AI Data Management

Feature scaling

  • Feature scaling to transform input features
  • Validate by checking the distributions among groups

feature scaling

Responsible AI Data Management

Feature encoding

  • Assess encoding effect on outcomes
  • Check for biases and information loss
  • Check for overfitting
  • Employ regularization and dimensionality reduction
Responsible AI Data Management

Dimensionality reduction

  • Reduce input features and preserve essential information
  • May create bias
  • Use fairness-conscious techniques like t-SNE

dimensionality reduction

Responsible AI Data Management

Financial advisor

  • "Annual income" and "Investment frequency" features
  • Adjust outliers and scale
  • Subgroup analysis

outlier removal

Responsible AI Data Management

Let's practice!

Responsible AI Data Management

Preparing Video For Download...