Feature Engineering with PySpark
John Hogue
Lead Data Scientist, General Mills

df.select(['NO', 'UNITNUMBER', 'CLASS']).show()
+----+----------+-----+
|  NO|UNITNUMBER|CLASS|
+----+----------+-----+
|   1|      null|   SF|
| 156|        A8|   SF|
| 157|       207|   SF|
| 158|      701G|   SF|
| 159|        36|   SF|
Multiple fields are not needed for our analysis
'NO' auto-generated record number'UNITNUMBER' irrelevant data'CLASS' all constant drop(*cols)
*cols – a column name to drop or a list of column names to drop.# List of columns to drop cols_to_drop = ['NO', 'UNITNUMBER', 'CLASS']# Drop the columns df = df.drop(*cols_to_drop)
where(condition)types.BooleanType or a string of SQL expression.like(other)~df = df.where(~df['POTENTIALSHORTSALE'].like('Not Disclosed'))
Filter data to within three standard deviations (3?) of the mean (?)

# Calculate values used for filtering std_val = df.agg({'SALESCLOSEPRICE': 'stddev'}).collect()[0][0] mean_val = df.agg({'SALESCLOSEPRICE': 'mean'}).collect()[0][0]# Create three standard deviation (? ± 3?) upper and lower bounds for data hi_bound = mean_val + (3 * std_val) low_bound = mean_val - (3 * std_val)# Use where() to filter the DataFrame between values df = df.where((df['LISTPRICE'] < hi_bound) & (df['LISTPRICE'] > low_bound))
 DataFrame.dropna()
how: ‘any’ or ‘all’. If ‘any’, drop a record if it contains any nulls. If ‘all’, drop a record only if all its values are null.thresh: int, default None If specified, drop records that have less than thresh non-null values. This overwrites the how parameter.subset: optional list of column names to consider.# Drop any records with NULL values df = df.dropna()# drop records if both LISTPRICE and SALESCLOSEPRICE are NULL df = df.dropna(how='all', subset['LISTPRICE', 'SALESCLOSEPRICE '])# Drop records where at least two columns have NULL values df = df.dropna(thresh=2)
What is a duplicate?
 dropDuplicates()
# Entire DataFrame df.dropDuplicates()# Check only a column list df.dropDuplicates(['streetaddress'])
Feature Engineering with PySpark