Data integration
Responsible AI Data Management
Maria Prokofieva
Lead ML engineer
What we will cover
Why integration is necessary
Its benefits, and its complications
Steps for data integration
Why have multiple sources?
Comprehensive detailed view
Safety net
Data diversity and fairness
Explainability, transparency and accountability
Beware of issues
Compromise data quality
Introduce inconsistencies
Amplify biases
Reduce representation
Model complexity
Reduced transparency and explainability
1
Image by Streamline HQ
Step 1. Data sources selection
Follow the evaluation steps
Assess the data sources
More balanced and comprehensive dataset
Step 2. Aligning data types
Identify common variables
Standardize names and formats
Normalize numerical data
Consolidating categorical data
Align data granularity
Step 3. Bias and representation enhancement
Weighting
Domain knowledge
Assign weights to under or overrepresented groups
Balancing
Equal representation
Over or undersampling
Algorithmic checks
Gap analysis
Step 4. Document
Detailed records:
Data integration steps
Decisions made
Detailed metadata:
Data sources
Collection methodology
Applied transformation
Urban traffic flow project
Select data sources
Identify common features
Develop a Unified Data Model
1
Images by Streamline HQ
Urban traffic flow project
Use statistical techniques for bias and representation
Apply weighting adjustments
Gap analysis and reweighting
Document
Let's practice!
Responsible AI Data Management
Preparing Video For Download...