Machine Learning dengan PySpark
Andrew Collier
Data Scientist, Fathom Data
# Counts for 'type' category
+-------+-----+
| type|count|
+-------+-----+
|Midsize| 22|
| Small| 21|
|Compact| 16|
| Sporty| 14|
| Large| 11|
| Van| 9|
+-------+-----+
# Numerical indices for 'type' category
+-------+--------+
| type|type_idx|
+-------+--------+
|Midsize| 0.0|
| Small| 1.0|
|Compact| 2.0|
| Sporty| 3.0|
| Large| 4.0|
| Van| 5.0|
+-------+--------+
+-------+ +-------+-------+-------+-------+-------+-------+
| type| |Midsize| Small|Compact| Sporty| Large| Van|
+-------+ +-------+-------+-------+-------+-------+-------+
|Midsize| | X | | | | | |
| Small| | | X | | | | |
|Compact| ===> | | | X | | | |
| Sporty| | | | | X | | |
| Large| | | | | | X | |
| Van| | | | | | | X |
+-------+ +-------+-------+-------+-------+-------+-------+
Setiap level kategorikal menjadi kolom.
+-------+ +-------+-------+-------+-------+-------+-------+
| type| |Midsize| Small|Compact| Sporty| Large| Van|
+-------+ +-------+-------+-------+-------+-------+-------+
|Midsize| | 1 | 0 | 0 | 0 | 0 | 0 |
| Small| | 0 | 1 | 0 | 0 | 0 | 0 |
|Compact| ===> | 0 | 0 | 1 | 0 | 0 | 0 |
| Sporty| | 0 | 0 | 0 | 1 | 0 | 0 |
| Large| | 0 | 0 | 0 | 0 | 1 | 0 |
| Van| | 0 | 0 | 0 | 0 | 0 | 1 |
+-------+ +-------+-------+-------+-------+-------+-------+
Nilai biner menandakan ada (1) atau tidak ada (0) level terkait.
+-------+ +-------+-------+-------+-------+-------+-------+ +------+-----+
| type| |Midsize| Small|Compact| Sporty| Large| Van| |Column|Value|
+-------+ +-------+-------+-------+-------+-------+-------+ +------+-----+
|Midsize| | 1 | 0 | 0 | 0 | 0 | 0 | | 0| 1|
| Small| | 0 | 1 | 0 | 0 | 0 | 0 | | 1| 1|
|Compact| ===> | 0 | 0 | 1 | 0 | 0 | 0 | ===> | 2| 1|
| Sporty| | 0 | 0 | 0 | 1 | 0 | 0 | | 3| 1|
| Large| | 0 | 0 | 0 | 0 | 1 | 0 | | 4| 1|
| Van| | 0 | 0 | 0 | 0 | 0 | 1 | | 5| 1|
+-------+ +-------+-------+-------+-------+-------+-------+ +------+-----+
Representasi jarang: simpan indeks kolom dan nilainya.
+-------+ +-------+-------+-------+-------+-------+ +------+-----+
| type| |Midsize| Small|Compact| Sporty| Large| |Column|Value|
+-------+ +-------+-------+-------+-------+-------+ +------+-----+
|Midsize| | 1 | 0 | 0 | 0 | 0 | | 0| 1|
| Small| | 0 | 1 | 0 | 0 | 0 | | 1| 1|
|Compact| ===> | 0 | 0 | 1 | 0 | 0 | ===> | 2| 1|
| Sporty| | 0 | 0 | 0 | 1 | 0 | | 3| 1|
| Large| | 0 | 0 | 0 | 0 | 1 | | 4| 1|
| Van| | 0 | 0 | 0 | 0 | 0 | | | |
+-------+ +-------+-------+-------+-------+-------+ +------+-----+
Level saling eksklusif, jadi hapus satu.
from pyspark.ml.feature import OneHotEncoder
onehot = OneHotEncoder(inputCols=['type_idx'], outputCols=['type_dummy'])
Fit encoder ke data.
onehot = onehot.fit(cars)
# How many category levels?
onehot.categorySizes
[6]
cars = onehot.transform(cars)
cars.select('type', 'type_idx', 'type_dummy').distinct().sort('type_idx').show()
+-------+--------+-------------+
| type|type_idx| type_dummy|
+-------+--------+-------------+
|Midsize| 0.0|(5,[0],[1.0])|
| Small| 1.0|(5,[1],[1.0])|
|Compact| 2.0|(5,[2],[1.0])|
| Sporty| 3.0|(5,[3],[1.0])|
| Large| 4.0|(5,[4],[1.0])|
| Van| 5.0| (5,[],[])|
+-------+--------+-------------+
from pyspark.mllib.linalg import DenseVector, SparseVector
Simpan vektor ini: [1, 0, 0, 0, 0, 7, 0, 0].
DenseVector([1, 0, 0, 0, 0, 7, 0, 0])
DenseVector([1.0, 0.0, 0.0, 0.0, 0.0, 7.0, 0.0, 0.0])
SparseVector(8, [0, 5], [1, 7])
SparseVector(8, {0: 1.0, 5: 7.0})
Machine Learning dengan PySpark