Binerisasi, Pembagian ke Bucket & Pengodean

Rekayasa Fitur dengan PySpark

John Hogue

Lead Data Scientist

Binerisasi

FIREPLACES menjadi Has_Fireplace
1 1
3 1
1 1
2 1
0 0
Rekayasa Fitur dengan PySpark

Binerisasi

from pyspark.ml.feature import Binarizer

# Ubah tipe data ke double df = df.withColumn('FIREPLACES', df['FIREPLACES'].cast('double'))
# Buat transformer binerisasi bin = Binarizer(threshold=0.0, inputCol='FIREPLACES', outputCol='FireplaceT') # Terapkan transformer df = bin.transform(df)
# Periksa hasil df[['FIREPLACES','FireplaceT']].show(3)
+----------+-------------+
|FIREPLACES|   FireplaceT|
+----------+-------------+
|       0.0|          0.0|
|       1.0|          1.0|
|       2.0|          1.0|
+----------+-------------+
only showing top 3 rows
Rekayasa Fitur dengan PySpark

Pembagian ke Bucket

from pyspark.ml.feature import Bucketizer

# Tentukan cara membagi data splits = [0, 1, 2, 3, 4, float('Inf')]
# Buat transformer bucket buck = Bucketizer(splits=splits, inputCol='BATHSTOTAL', outputCol='baths') # Terapkan transformer df = buck.transform(df)
# Periksa hasil df[['BATHSTOTAL', 'baths']].show(4)
+----------+-----------------+
|BATHSTOTAL|baths            |
+----------+-----------------+
|         2|              2.0|
|         3|              3.0|
|         1|              1.0|
|         5|              4.0|
+----------+-----------------+
only showing top 4 rows

Rekayasa Fitur dengan PySpark

One-Hot Encoding

CITY menjadi LELM MAPW OAKD STP WB
LELM - Lake Elmo 1 0 0 0 0
MAPW - Maplewood 0 1 0 0 0
OAKD - Oakdale 0 0 1 0 0
STP - Saint Paul 0 0 0 1 0
WB - Woodbury 0 0 0 0 1
Rekayasa Fitur dengan PySpark

One-Hot Encoding ala PySpark

from pyspark.ml.feature import OneHotEncoder, StringIndexer
# Buat transformer indexer
stringIndexer = StringIndexer(inputCol='CITY', outputCol='City_Index')
# Fit transformer
model = stringIndexer.fit(df)
# Terapkan transformer
indexed = model.transform(df)
Rekayasa Fitur dengan PySpark

One-Hot Encoding ala PySpark

# Buat transformer encoder
encoder = OneHotEncoder(inputCol='City_Index', outputCol='City_Vec)
# Terapkan transformer encoder
encoded_df = encoder.transform(indexed)
# Periksa hasil
encoded_df[['City_Vec']].show(4)
+-------------+
|     City_Vec|
+-------------+
|    (4,[],[])|
|    (4,[],[])|
|(4,[2],[1.0])|
|(4,[2],[1.0])|
+-------------+
only showing top 4 rows
Rekayasa Fitur dengan PySpark

Saatnya Transformasi!

Rekayasa Fitur dengan PySpark

Preparing Video For Download...