Feature Engineering with PySpark
John Hogue
Lead Data Scientist, General Mills
df.select(['NO', 'UNITNUMBER', 'CLASS']).show()
+----+----------+-----+
| NO|UNITNUMBER|CLASS|
+----+----------+-----+
| 1| null| SF|
| 156| A8| SF|
| 157| 207| SF|
| 158| 701G| SF|
| 159| 36| SF|
Multiple fields are not needed for our analysis
'NO'
auto-generated record number'UNITNUMBER'
irrelevant data'CLASS'
all constant drop(*cols)
*cols
– a column name to drop or a list of column names to drop.# List of columns to drop cols_to_drop = ['NO', 'UNITNUMBER', 'CLASS']
# Drop the columns df = df.drop(*cols_to_drop)
where(condition)
types.BooleanType
or a string of SQL expression.like(other)
~
df = df.where(~df['POTENTIALSHORTSALE'].like('Not Disclosed'))
Filter data to within three standard deviations (3?) of the mean (?)
# Calculate values used for filtering std_val = df.agg({'SALESCLOSEPRICE': 'stddev'}).collect()[0][0] mean_val = df.agg({'SALESCLOSEPRICE': 'mean'}).collect()[0][0]
# Create three standard deviation (? ± 3?) upper and lower bounds for data hi_bound = mean_val + (3 * std_val) low_bound = mean_val - (3 * std_val)
# Use where() to filter the DataFrame between values df = df.where((df['LISTPRICE'] < hi_bound) & (df['LISTPRICE'] > low_bound))
DataFrame.dropna()
how
: ‘any’ or ‘all’. If ‘any’, drop a record if it contains any nulls. If ‘all’, drop a record only if all its values are null.thresh
: int, default None If specified, drop records that have less than thresh non-null values. This overwrites the how parameter.subset
: optional list of column names to consider.# Drop any records with NULL values df = df.dropna()
# drop records if both LISTPRICE and SALESCLOSEPRICE are NULL df = df.dropna(how='all', subset['LISTPRICE', 'SALESCLOSEPRICE '])
# Drop records where at least two columns have NULL values df = df.dropna(thresh=2)
What is a duplicate?
dropDuplicates()
# Entire DataFrame df.dropDuplicates()
# Check only a column list df.dropDuplicates(['streetaddress'])
Feature Engineering with PySpark