Common data problems

Introduction to Data Literacy

Jess Ahmet

Content Developer, DataCamp

Dirty data

  • Dirty data is:

    • Incorrect
    • Incomplete
    • Inconsistent
  • Caused by human error, technical issues, or issues with the data collection process

  • Garbage in, garbage out principle: dirty data can lead to wrong conclusions

Dirty window

Introduction to Data Literacy

Data errors

  • Data is incorrect or inconsistent
  • Typically caused by human or technical error in recording the value or the format
  • Techniques to counter:
    • If original value or valid format is known: correct data
    • If unknown: drop data

Puzzle with the wrong piece

Introduction to Data Literacy

Missing data

  • Data is incomplete
  • Problematic if:
    • Many data points are missing
    • There are underlying patterns in the missing data
  • Techniques to counter:
    • Dropping data
    • Imputation

Puzzle with missing piece

Introduction to Data Literacy

Data bias

  • Societal bias can be reflected in data as data bias
  • Leads to unrepresentative data and therefore results
  • Hard to detect and to resolve
  • Techniques to counter:
    • Sound data collection process
    • Awareness in conclusions
    • Explainable AI models

Grey puzzle with white pieces left out

Introduction to Data Literacy

Data cleaning

  • Set of techniques to counter data problems
  • Important preparation step for any data analysis
  • But not all data problems are (completely) solvable
  • It is always possible to do some kind of analysis

Gloved hand with spray bottle

Introduction to Data Literacy

Let's practice!

Introduction to Data Literacy

Preparing Video For Download...