Where to Begin

Feature Engineering with PySpark

John Hogue

Lead Data Scientist, General Mills

Diving Straight to Analysis

Here be Monsters

  • Become your own expert
  • Define goals of analysis
  • Research your data
  • Be curious, ask questions

Monsters

Feature Engineering with PySpark

The Data Science Process

Data Science Process

Feature Engineering with PySpark

Spark changes fast and frequently

Feature Engineering with PySpark

Data Formats: Parquet

Data is supplied as Parquet

  • Stored Column-wise
    • Fast to query column subsets
  • Structured, defined schema
    • Fields and Data Types defined
    • Great for messy text data
  • Industry Adopted
    • Good skill to have! ?

Parquet File Format

Feature Engineering with PySpark

Getting the Data to Spark

PySpark read methods

  • PySpark supports many file types!
# JSON
spark.read.json('example.json')
# CSV or delimited files
spark.read.csv('example.csv')
# Parquet
spark.read.parquet('example.parq')
# Read a parquet file to a PySpark DataFrame
df = spark.read.parquet('example.parq')
Feature Engineering with PySpark

Let's Practice!

Feature Engineering with PySpark

Preparing Video For Download...