Feature Engineering with PySpark
John Hogue
Lead Data Scientist, General Mills
PROS

CONS

Orienting our data directions

DataFrame.join(other, # Other DataFrame to mergeon=None, # The keys to join onhow=None) # Type of join to perform (default is 'inner')
# Inspect dataframe head
hdf.show(2)
+----------+--------------------+
| dt| nm|
+----------+--------------------+
|2012-01-02| New Year Day|
|2012-01-16|Martin Luther Kin...|
+----------+--------------------+
only showing top 2 rows
# Specify join conditon cond = [df['OFFMARKETDATE'] == hdf['dt']]# Join two hdf onto df df = df.join(hdf, on=cond, 'left')# How many sales occurred on bank holidays? df.where(~df['nm'].isNull()).count()
0
# Register the dataframe as a temp table
df.createOrReplaceTempView("df")
hdf.createOrReplaceTempView("hdf")
# Write a SQL Statement
sql_df = spark.sql("""
SELECT
*
FROM df
LEFT JOIN hdf
ON df.OFFMARKETDATE = hdf.dt
""")
Feature Engineering with PySpark