Feature Engineering with PySpark
John Hogue
Lead Data Scientist
FIREPLACES | becomes | Has_Fireplace |
---|---|---|
1 | ? | 1 |
3 | ? | 1 |
1 | ? | 1 |
2 | ? | 1 |
0 | ? | 0 |
from pyspark.ml.feature import Binarizer
# Cast the data type to double df = df.withColumn('FIREPLACES', df['FIREPLACES'].cast('double'))
# Create binarizing transformer bin = Binarizer(threshold=0.0, inputCol='FIREPLACES', outputCol='FireplaceT') # Apply the transformer df = bin.transform(df)
# Inspect the results df[['FIREPLACES','FireplaceT']].show(3)
+----------+-------------+
|FIREPLACES| FireplaceT|
+----------+-------------+
| 0.0| 0.0|
| 1.0| 1.0|
| 2.0| 1.0|
+----------+-------------+
only showing top 3 rows
from pyspark.ml.feature import Bucketizer
# Define how to split data splits = [0, 1, 2, 3, 4, float('Inf')]
# Create bucketing transformer buck = Bucketizer(splits=splits, inputCol='BATHSTOTAL', outputCol='baths') # Apply transformer df = buck.transform(df)
# Inspect results df[['BATHSTOTAL', 'baths']].show(4)
+----------+-----------------+
|BATHSTOTAL|baths |
+----------+-----------------+
| 2| 2.0|
| 3| 3.0|
| 1| 1.0|
| 5| 4.0|
+----------+-----------------+
only showing top 4 rows
CITY | becomes | LELM | MAPW | OAKD | STP | WB |
---|---|---|---|---|---|---|
LELM - Lake Elmo | ? | 1 | 0 | 0 | 0 | 0 |
MAPW - Maplewood | ? | 0 | 1 | 0 | 0 | 0 |
OAKD - Oakdale | ? | 0 | 0 | 1 | 0 | 0 |
STP - Saint Paul | ? | 0 | 0 | 0 | 1 | 0 |
WB - Woodbury | ? | 0 | 0 | 0 | 0 | 1 |
from pyspark.ml.feature import OneHotEncoder, StringIndexer
# Create indexer transformer
stringIndexer = StringIndexer(inputCol='CITY', outputCol='City_Index')
# Fit transformer
model = stringIndexer.fit(df)
# Apply transformer
indexed = model.transform(df)
# Create encoder transformer
encoder = OneHotEncoder(inputCol='City_Index', outputCol='City_Vec)
# Apply the encoder transformer
encoded_df = encoder.transform(indexed)
# Inspect results
encoded_df[['City_Vec']].show(4)
+-------------+
| City_Vec|
+-------------+
| (4,[],[])|
| (4,[],[])|
|(4,[2],[1.0])|
|(4,[2],[1.0])|
+-------------+
only showing top 4 rows
Feature Engineering with PySpark