Feature Engineering with PySpark
John Hogue
Lead Data Scientist
| FIREPLACES | becomes | Has_Fireplace |
|---|---|---|
| 1 | ? | 1 |
| 3 | ? | 1 |
| 1 | ? | 1 |
| 2 | ? | 1 |
| 0 | ? | 0 |
from pyspark.ml.feature import Binarizer# Cast the data type to double df = df.withColumn('FIREPLACES', df['FIREPLACES'].cast('double'))# Create binarizing transformer bin = Binarizer(threshold=0.0, inputCol='FIREPLACES', outputCol='FireplaceT') # Apply the transformer df = bin.transform(df)# Inspect the results df[['FIREPLACES','FireplaceT']].show(3)
+----------+-------------+
|FIREPLACES| FireplaceT|
+----------+-------------+
| 0.0| 0.0|
| 1.0| 1.0|
| 2.0| 1.0|
+----------+-------------+
only showing top 3 rows
from pyspark.ml.feature import Bucketizer# Define how to split data splits = [0, 1, 2, 3, 4, float('Inf')]# Create bucketing transformer buck = Bucketizer(splits=splits, inputCol='BATHSTOTAL', outputCol='baths') # Apply transformer df = buck.transform(df)# Inspect results df[['BATHSTOTAL', 'baths']].show(4)
+----------+-----------------+
|BATHSTOTAL|baths |
+----------+-----------------+
| 2| 2.0|
| 3| 3.0|
| 1| 1.0|
| 5| 4.0|
+----------+-----------------+
only showing top 4 rows
| CITY | becomes | LELM | MAPW | OAKD | STP | WB |
|---|---|---|---|---|---|---|
| LELM - Lake Elmo | ? | 1 | 0 | 0 | 0 | 0 |
| MAPW - Maplewood | ? | 0 | 1 | 0 | 0 | 0 |
| OAKD - Oakdale | ? | 0 | 0 | 1 | 0 | 0 |
| STP - Saint Paul | ? | 0 | 0 | 0 | 1 | 0 |
| WB - Woodbury | ? | 0 | 0 | 0 | 0 | 1 |
from pyspark.ml.feature import OneHotEncoder, StringIndexer
# Create indexer transformer
stringIndexer = StringIndexer(inputCol='CITY', outputCol='City_Index')
# Fit transformer
model = stringIndexer.fit(df)
# Apply transformer
indexed = model.transform(df)
# Create encoder transformer
encoder = OneHotEncoder(inputCol='City_Index', outputCol='City_Vec)
# Apply the encoder transformer
encoded_df = encoder.transform(indexed)
# Inspect results
encoded_df[['City_Vec']].show(4)
+-------------+
| City_Vec|
+-------------+
| (4,[],[])|
| (4,[],[])|
|(4,[2],[1.0])|
|(4,[2],[1.0])|
+-------------+
only showing top 4 rows
Feature Engineering with PySpark