Binarizing, Bucketing & Encoding

Feature Engineering with PySpark

John Hogue

Lead Data Scientist

Binarizing

FIREPLACES	becomes	Has_Fireplace
1	⇨	1
3	⇨	1
1	⇨	1
2	⇨	1
0	⇨	0

Binarizing

from pyspark.ml.feature import Binarizer

# Cast the data type to double
df = df.withColumn('FIREPLACES', df['FIREPLACES'].cast('double'))

# Create binarizing transformer
bin = Binarizer(threshold=0.0, inputCol='FIREPLACES', outputCol='FireplaceT')
# Apply the transformer
df = bin.transform(df)

# Inspect the results
df[['FIREPLACES','FireplaceT']].show(3)

+----------+-------------+
|FIREPLACES|   FireplaceT|
+----------+-------------+
|       0.0|          0.0|
|       1.0|          1.0|
|       2.0|          1.0|
+----------+-------------+
only showing top 3 rows

Bucketing

from pyspark.ml.feature import Bucketizer

# Define how to split data
splits = [0, 1, 2, 3, 4, float('Inf')]

# Create bucketing transformer
buck = Bucketizer(splits=splits, inputCol='BATHSTOTAL', outputCol='baths')
# Apply transformer
df = buck.transform(df)

# Inspect results
df[['BATHSTOTAL', 'baths']].show(4)

+----------+-----------------+
|BATHSTOTAL|baths            |
+----------+-----------------+
|         2|              2.0|
|         3|              3.0|
|         1|              1.0|
|         5|              4.0|
+----------+-----------------+
only showing top 4 rows

One Hot Encoding

CITY	becomes	LELM	MAPW	OAKD	STP	WB
LELM - Lake Elmo	⇨	1	0	0	0	0
MAPW - Maplewood	⇨	0	1	0	0	0
OAKD - Oakdale	⇨	0	0	1	0	0
STP - Saint Paul	⇨	0	0	0	1	0
WB - Woodbury	⇨	0	0	0	0	1

One Hot Encoding the PySpark Way

from pyspark.ml.feature import OneHotEncoder, StringIndexer

# Create indexer transformer
stringIndexer = StringIndexer(inputCol='CITY', outputCol='City_Index')

# Fit transformer
model = stringIndexer.fit(df)
# Apply transformer
indexed = model.transform(df)

One Hot Encoding the PySpark Way

# Create encoder transformer
encoder = OneHotEncoder(inputCol='City_Index', outputCol='City_Vec)

# Apply the encoder transformer
encoded_df = encoder.transform(indexed)

# Inspect results
encoded_df[['City_Vec']].show(4)

+-------------+
|     City_Vec|
+-------------+
|    (4,[],[])|
|    (4,[],[])|
|(4,[2],[1.0])|
|(4,[2],[1.0])|
+-------------+
only showing top 4 rows

Get Transforming!

Feature Engineering with PySpark