Getting More Data

Feature Engineering with PySpark

John Hogue

Lead Data Scientist, General Mills

Thoughts on External Data Sets

PROS

  • Add important predictors
  • Supplement/replace values
  • Cheap or easy to obtain

Combining Data

CONS

  • May 'bog' analysis down
  • Easy to induce data leakage
  • Become data set subject matter expert

Responsibility

Feature Engineering with PySpark

About Joins

Orienting our data directions

  • Left; our starting data set
  • Right; new data set to incorporate

SQL Joins

Feature Engineering with PySpark

PySpark DataFrame Joins

DataFrame.join(

other, # Other DataFrame to merge
on=None, # The keys to join on
how=None) # Type of join to perform (default is 'inner')
Feature Engineering with PySpark

PySpark Join Example

# Inspect dataframe head
hdf.show(2)
+----------+--------------------+
|        dt|                  nm|
+----------+--------------------+
|2012-01-02|        New Year Day|
|2012-01-16|Martin Luther Kin...|
+----------+--------------------+
only showing top 2 rows
# Specify join conditon
cond = [df['OFFMARKETDATE'] == hdf['dt']]

# Join two hdf onto df df = df.join(hdf, on=cond, 'left')
# How many sales occurred on bank holidays? df.where(~df['nm'].isNull()).count()
0
Feature Engineering with PySpark

SparkSQL Join

  • Apply SQL to your dataframe
# Register the dataframe as a temp table
df.createOrReplaceTempView("df")
hdf.createOrReplaceTempView("hdf")
# Write a SQL Statement
sql_df = spark.sql("""
                      SELECT 
                        *
                      FROM df
                      LEFT JOIN hdf
                      ON df.OFFMARKETDATE = hdf.dt
                   """)
Feature Engineering with PySpark

Let's Join Some Data!

Feature Engineering with PySpark

Preparing Video For Download...