Dropping Data

Feature Engineering with PySpark

John Hogue

Lead Data Scientist, General Mills

Where can data go bad?

  • Recorded wrong
  • Unique events
  • Formatted incorrectly
  • Duplications
  • Missing
  • Not relevant

Dominos

Feature Engineering with PySpark

Dropping Columns

df.select(['NO', 'UNITNUMBER', 'CLASS']).show()
+----+----------+-----+
|  NO|UNITNUMBER|CLASS|
+----+----------+-----+
|   1|      null|   SF|
| 156|        A8|   SF|
| 157|       207|   SF|
| 158|      701G|   SF|
| 159|        36|   SF|

Multiple fields are not needed for our analysis

  • 'NO' auto-generated record number
  • 'UNITNUMBER' irrelevant data
  • 'CLASS' all constant
Feature Engineering with PySpark

Dropping Columns

drop(*cols)

  • *cols – a column name to drop or a list of column names to drop.
  • Returns a new DataFrame that drops the specified
# List of columns to drop
cols_to_drop = ['NO', 'UNITNUMBER', 'CLASS']

# Drop the columns df = df.drop(*cols_to_drop)
Feature Engineering with PySpark

Text Filtering

  • where(condition)
    • condition – a Column of types.BooleanType or a string of SQL expression.
    • Filters dataframe where the condition is true
  • like(other)
    • other – a SQL LIKE pattern
    • Returns a boolean Column
  • ~
    • The NOT condition
df = df.where(~df['POTENTIALSHORTSALE'].like('Not Disclosed'))
Feature Engineering with PySpark

Outlier Filtering

Filter data to within three standard deviations (3?) of the mean (?)

Standard Normal Distribution

Feature Engineering with PySpark

Value Filtering Example

# Calculate values used for filtering
std_val = df.agg({'SALESCLOSEPRICE': 'stddev'}).collect()[0][0]
mean_val = df.agg({'SALESCLOSEPRICE': 'mean'}).collect()[0][0]

# Create three standard deviation (? ± 3?) upper and lower bounds for data hi_bound = mean_val + (3 * std_val) low_bound = mean_val - (3 * std_val)
# Use where() to filter the DataFrame between values df = df.where((df['LISTPRICE'] < hi_bound) & (df['LISTPRICE'] > low_bound))
Feature Engineering with PySpark

Dropping NA's or NULLs

DataFrame.dropna()

  • how: ‘any’ or ‘all’. If ‘any’, drop a record if it contains any nulls. If ‘all’, drop a record only if all its values are null.
  • thresh: int, default None If specified, drop records that have less than thresh non-null values. This overwrites the how parameter.
  • subset: optional list of column names to consider.
Feature Engineering with PySpark

Dropping NA's or NULLs

# Drop any records with NULL values
df = df.dropna()

# drop records if both LISTPRICE and SALESCLOSEPRICE are NULL df = df.dropna(how='all', subset['LISTPRICE', 'SALESCLOSEPRICE '])
# Drop records where at least two columns have NULL values df = df.dropna(thresh=2)
Feature Engineering with PySpark

Dropping Duplicates

What is a duplicate?

  • Two or more records contains all the same information
  • After dropping columns or joining datasets, check for duplicates

dropDuplicates()

  • Can be run across entire DataFrame or a list of columns
  • In PySpark there is no order for which record is removed
# Entire DataFrame
df.dropDuplicates()

# Check only a column list df.dropDuplicates(['streetaddress'])
Feature Engineering with PySpark

Let's practice!

Feature Engineering with PySpark

Preparing Video For Download...