Databases and quality checks

Demystifying Decision Science

Howard Friedman

Adjunct Professor at Columbia University

Triaging data sources

Not all data is equally useful

  • Some datasets require heavy cleaning or lack meaningful features
  • Others may not be worth the effort at all

 

Data triaging helps prioritize quickly

  • Assess whether a dataset is viable before investing time and resources

Start with availability

  • Is the data easy to access or behind a paywall
  • Are there permission or security barriers
  • Sorting this out early helps avoid wasted effort

triage.png

Demystifying Decision Science

Key checks

Consider costs

  • Are there licensing or access fees
  • Will you need to pay for storage or specialized tools

Evaluate utility

  • Does the dataset contain what you need for your analysis
  • Check the scope, detail, completeness, and feature relevance

Check update frequency

  • Real-time predictions need real-time or regularly updated data
  • Mismatched update cycles can render data unusable

Assess geographic resolution

  • Does the data align with the spatial level your analysis requires
  • National-level data won't help if you need zip-code granularity
Demystifying Decision Science

Data quality checks

Data quality matters

  • Bad data can ruin models and lead to bad decisions
  • "Garbage in, garbage out" is a real risk in decision science

GIGO.png

Watch for missingness

  • Percentage of missing values in each column
  • Patterns that suggest deeper data collection issues

Do basic range checks

  • Are there values that don't make sense for your context
  • Examples: age = 450 or price = -$10
  • Use checks to flag major issues early
Demystifying Decision Science

More checks

Review outliers

  • Use histograms or boxplots to visualize
  • Discuss with stakeholders before excluding anything

Assess timeliness

  • Ask whether your data is recent enough to be relevant
  • Fast-moving contexts need frequently updated data

Ensure formatting consistency

  • Look for inconsistent date formats, spelling issues, or column naming mismatches
  • Clean and standardize to avoid problems during analysis

QA.png

Demystifying Decision Science

Let's practice!

Demystifying Decision Science

Preparing Video For Download...