Binarizing, Bucketing & Encoding

Feature Engineering with PySpark

John Hogue

Lead Data Scientist

Binarizing

FIREPLACES becomes Has_Fireplace
1 ? 1
3 ? 1
1 ? 1
2 ? 1
0 ? 0
Feature Engineering with PySpark

Binarizing

from pyspark.ml.feature import Binarizer

# Cast the data type to double df = df.withColumn('FIREPLACES', df['FIREPLACES'].cast('double'))
# Create binarizing transformer bin = Binarizer(threshold=0.0, inputCol='FIREPLACES', outputCol='FireplaceT') # Apply the transformer df = bin.transform(df)
# Inspect the results df[['FIREPLACES','FireplaceT']].show(3)
+----------+-------------+
|FIREPLACES|   FireplaceT|
+----------+-------------+
|       0.0|          0.0|
|       1.0|          1.0|
|       2.0|          1.0|
+----------+-------------+
only showing top 3 rows
Feature Engineering with PySpark

Bucketing

from pyspark.ml.feature import Bucketizer

# Define how to split data splits = [0, 1, 2, 3, 4, float('Inf')]
# Create bucketing transformer buck = Bucketizer(splits=splits, inputCol='BATHSTOTAL', outputCol='baths') # Apply transformer df = buck.transform(df)
# Inspect results df[['BATHSTOTAL', 'baths']].show(4)
+----------+-----------------+
|BATHSTOTAL|baths            |
+----------+-----------------+
|         2|              2.0|
|         3|              3.0|
|         1|              1.0|
|         5|              4.0|
+----------+-----------------+
only showing top 4 rows

Feature Engineering with PySpark

One Hot Encoding

CITY becomes LELM MAPW OAKD STP WB
LELM - Lake Elmo ? 1 0 0 0 0
MAPW - Maplewood ? 0 1 0 0 0
OAKD - Oakdale ? 0 0 1 0 0
STP - Saint Paul ? 0 0 0 1 0
WB - Woodbury ? 0 0 0 0 1
Feature Engineering with PySpark

One Hot Encoding the PySpark Way

from pyspark.ml.feature import OneHotEncoder, StringIndexer
# Create indexer transformer
stringIndexer = StringIndexer(inputCol='CITY', outputCol='City_Index')
# Fit transformer
model = stringIndexer.fit(df)
# Apply transformer
indexed = model.transform(df)
Feature Engineering with PySpark

One Hot Encoding the PySpark Way

# Create encoder transformer
encoder = OneHotEncoder(inputCol='City_Index', outputCol='City_Vec)
# Apply the encoder transformer
encoded_df = encoder.transform(indexed)
# Inspect results
encoded_df[['City_Vec']].show(4)
+-------------+
|     City_Vec|
+-------------+
|    (4,[],[])|
|    (4,[],[])|
|(4,[2],[1.0])|
|(4,[2],[1.0])|
+-------------+
only showing top 4 rows
Feature Engineering with PySpark

Get Transforming!

Feature Engineering with PySpark

Preparing Video For Download...