Data handling techniques

Cleaning Data with PySpark

Mike Metzger

Data Engineering Consultant

What are we trying to parse?

Incorrect data
- Empty rows
- Commented lines
- Headers
Nested structures
- Multiple delimiters
Non-regular data
- Differing numbers of columns per row
Focused on CSV data

width, height, image

# This is a comment

200    300    affenpinscher;0

600    450    Collie;307    Collie;101
600    449    Japanese_spaniel;23

Stanford ImageNet annotations

Identifies dog breeds in images
Provides list of all identified dogs in image
Other metadata (base folder, image size, etc.)

Example rows:

02111277    n02111277_3206    500    375    Newfoundland,110,73,416,298
02108422    n02108422_4375    500    375    bull_mastiff,101,90,214,356 \
 bull_mastiff,282,74,416,370

Removing blank lines, headers, and comments

Spark's CSV parser:

Automatically removes blank lines
Can remove comments using an optional argument

df1 = spark.read.csv('datafile.csv.gz', comment='#')

Handles header fields
- Defined via argument
- Ignored if a schema is defined

df1 = spark.read.csv('datafile.csv.gz', header='True')

Automatic column creation

Spark will:

Automatically create columns in a DataFrame based on sep argument
```
df1 = spark.read.csv('datafile.csv.gz', sep=',')
```
Defaults to using ,

Can still successfully parse if sep is not in string

df1 = spark.read.csv('datafile.csv.gz', sep='*')

Stores data in column defaulting to _c0
Allows you to properly handle nested separators

Let's practice!

Cleaning Data with PySpark