Data integration

Responsible AI Data Management

Maria Prokofieva

Lead ML engineer

What we will cover

  • Why integration is necessary
  • Its benefits, and its complications
  • Steps for data integration
Responsible AI Data Management

Why have multiple sources?

  • Comprehensive detailed view
  • Safety net
  • Data diversity and fairness
  • Explainability, transparency and accountability
Responsible AI Data Management

Beware of issues

  • Compromise data quality
  • Introduce inconsistencies
  • Amplify biases
  • Reduce representation
  • Model complexity
  • Reduced transparency and explainability

Triangular caution sign

1 Image by Streamline HQ
Responsible AI Data Management

Step 1. Data sources selection

  • Follow the evaluation steps
  • Assess the data sources
  • More balanced and comprehensive dataset
Responsible AI Data Management

Step 2. Aligning data types

  • Identify common variables
  • Standardize names and formats
  • Normalize numerical data
  • Consolidating categorical data
  • Align data granularity

Cooperation of a team

Responsible AI Data Management

Step 3. Bias and representation enhancement

  • Weighting
    • Domain knowledge
    • Assign weights to under or overrepresented groups
  • Balancing
    • Equal representation
    • Over or undersampling
  • Algorithmic checks
  • Gap analysis

A balancing act

Responsible AI Data Management

Step 4. Document

  • Detailed records:
    • Data integration steps
    • Decisions made
  • Detailed metadata:
    • Data sources
    • Collection methodology
    • Applied transformation

Managing folders

Responsible AI Data Management

Urban traffic flow project

  • Select data sources
  • Identify common features
  • Develop a Unified Data Model

Urban traffic flow project

1 Images by Streamline HQ
Responsible AI Data Management

Urban traffic flow project

  • Use statistical techniques for bias and representation
  • Apply weighting adjustments
  • Gap analysis and reweighting
  • Document
Responsible AI Data Management

Let's practice!

Responsible AI Data Management

Preparing Video For Download...