Feature Engineering with PySpark
John Hogue
Lead Data Scientist, General Mills
PROS

CONS

Orienting our data directions

DataFrame.join(other, # Other DataFrame to mergeon=None, # The keys to join onhow=None) # Type of join to perform (default is 'inner')
# Inspect dataframe head
hdf.show(2)
+----------+--------------------+
|        dt|                  nm|
+----------+--------------------+
|2012-01-02|        New Year Day|
|2012-01-16|Martin Luther Kin...|
+----------+--------------------+
only showing top 2 rows
# Specify join conditon cond = [df['OFFMARKETDATE'] == hdf['dt']]# Join two hdf onto df df = df.join(hdf, on=cond, 'left')# How many sales occurred on bank holidays? df.where(~df['nm'].isNull()).count()
0
# Register the dataframe as a temp table
df.createOrReplaceTempView("df")
hdf.createOrReplaceTempView("hdf")
# Write a SQL Statement
sql_df = spark.sql("""
                      SELECT 
                        *
                      FROM df
                      LEFT JOIN hdf
                      ON df.OFFMARKETDATE = hdf.dt
                   """)
Feature Engineering with PySpark