Where to Begin

Rekayasa Fitur dengan PySpark

John Hogue

Lead Data Scientist, General Mills

Diving Straight to Analysis

Here be Monsters

  • Become your own expert
  • Define goals of analysis
  • Research your data
  • Be curious, ask questions

Monsters

Rekayasa Fitur dengan PySpark

The Data Science Process

Data Science Process

Rekayasa Fitur dengan PySpark

Spark changes fast and frequently

Rekayasa Fitur dengan PySpark

Data Formats: Parquet

Data is supplied as Parquet

  • Stored Column-wise
    • Fast to query column subsets
  • Structured, defined schema
    • Fields and Data Types defined
    • Great for messy text data
  • Industry Adopted
    • Good skill to have! 😃

Parquet File Format

Rekayasa Fitur dengan PySpark

Getting the Data to Spark

PySpark read methods

  • PySpark supports many file types!
# JSON
spark.read.json('example.json')
# CSV or delimited files
spark.read.csv('example.csv')
# Parquet
spark.read.parquet('example.parq')
# Read a parquet file to a PySpark DataFrame
df = spark.read.parquet('example.parq')
Rekayasa Fitur dengan PySpark

Let's Practice!

Rekayasa Fitur dengan PySpark

Preparing Video For Download...