Data handling techniques

Cleaning Data with PySpark

Mike Metzger

Data Engineering Consultant

What are we trying to parse?

  • Incorrect data
    • Empty rows
    • Commented lines
    • Headers
  • Nested structures
    • Multiple delimiters
  • Non-regular data
    • Differing numbers of columns per row
  • Focused on CSV data
width, height, image

# This is a comment
200    300    affenpinscher;0
600    450    Collie;307    Collie;101
600    449    Japanese_spaniel;23
Cleaning Data with PySpark

Stanford ImageNet annotations

  • Identifies dog breeds in images
  • Provides list of all identified dogs in image
  • Other metadata (base folder, image size, etc.)

Example rows:

02111277    n02111277_3206    500    375    Newfoundland,110,73,416,298
02108422    n02108422_4375    500    375    bull_mastiff,101,90,214,356 \
 bull_mastiff,282,74,416,370
Cleaning Data with PySpark

Removing blank lines, headers, and comments

Spark's CSV parser:

  • Automatically removes blank lines
  • Can remove comments using an optional argument
df1 = spark.read.csv('datafile.csv.gz', comment='#')
  • Handles header fields
    • Defined via argument
    • Ignored if a schema is defined
df1 = spark.read.csv('datafile.csv.gz', header='True')
Cleaning Data with PySpark

Automatic column creation

Spark will:

  • Automatically create columns in a DataFrame based on sep argument
    df1 = spark.read.csv('datafile.csv.gz', sep=',')
    
  • Defaults to using ,
  • Can still successfully parse if sep is not in string
    df1 = spark.read.csv('datafile.csv.gz', sep='*')
    
  • Stores data in column defaulting to _c0
  • Allows you to properly handle nested separators
Cleaning Data with PySpark

Let's practice!

Cleaning Data with PySpark

Preparing Video For Download...