Feature Engineering with PySpark
John Hogue
Lead Data Scientist, General Mills
Here be Monsters
Specific version (2.3.1)
Check your versions!
# return spark version
spark.version
# return python version
import sys
sys.version_info
Data is supplied as Parquet
PySpark read
methods
# JSON
spark.read.json('example.json')
# CSV or delimited files
spark.read.csv('example.csv')
# Parquet
spark.read.parquet('example.parq')
# Read a parquet file to a PySpark DataFrame
df = spark.read.parquet('example.parq')
Feature Engineering with PySpark