One-Hot Encoding

PySpark ile Machine Learning

Andrew Collier

Data Scientist, Fathom Data

İndisli değerlerin sorunu

# 'type' kategorisi için sayımlar

+-------+-----+
|   type|count|
+-------+-----+
|Midsize|   22|
|  Small|   21|
|Compact|   16|
| Sporty|   14|
|  Large|   11|
|    Van|    9|
+-------+-----+
# 'type' kategorisi için sayısal indisler

+-------+--------+
|   type|type_idx|
+-------+--------+
|Midsize|     0.0|
|  Small|     1.0|
|Compact|     2.0|
| Sporty|     3.0|
|  Large|     4.0|
|    Van|     5.0|
+-------+--------+
PySpark ile Machine Learning

Sahte değişkenler

+-------+      +-------+-------+-------+-------+-------+-------+
|   type|      |Midsize|  Small|Compact| Sporty|  Large|    Van|
+-------+      +-------+-------+-------+-------+-------+-------+
|Midsize|      |   X   |       |       |       |       |       |
|  Small|      |       |   X   |       |       |       |       |
|Compact| ===> |       |       |   X   |       |       |       |
| Sporty|      |       |       |       |   X   |       |       |
|  Large|      |       |       |       |       |   X   |       |
|    Van|      |       |       |       |       |       |   X   |
+-------+      +-------+-------+-------+-------+-------+-------+

Her kategorik düzey bir sütun olur.

PySpark ile Machine Learning

Sahte değişkenler: ikili kodlama

+-------+      +-------+-------+-------+-------+-------+-------+
|   type|      |Midsize|  Small|Compact| Sporty|  Large|    Van|
+-------+      +-------+-------+-------+-------+-------+-------+
|Midsize|      |   1   |   0   |   0   |   0   |   0   |   0   |
|  Small|      |   0   |   1   |   0   |   0   |   0   |   0   |
|Compact| ===> |   0   |   0   |   1   |   0   |   0   |   0   |
| Sporty|      |   0   |   0   |   0   |   1   |   0   |   0   |
|  Large|      |   0   |   0   |   0   |   0   |   1   |   0   |
|    Van|      |   0   |   0   |   0   |   0   |   0   |   1   |
+-------+      +-------+-------+-------+-------+-------+-------+

İkili değerler, ilgili düzeyin varlığını (1) veya yokluğunu (0) gösterir.

PySpark ile Machine Learning

Sahte değişkenler: seyrek gösterim

+-------+      +-------+-------+-------+-------+-------+-------+      +------+-----+
|   type|      |Midsize|  Small|Compact| Sporty|  Large|    Van|      |Column|Value|
+-------+      +-------+-------+-------+-------+-------+-------+      +------+-----+
|Midsize|      |   1   |   0   |   0   |   0   |   0   |   0   |      |     0|    1|
|  Small|      |   0   |   1   |   0   |   0   |   0   |   0   |      |     1|    1|
|Compact| ===> |   0   |   0   |   1   |   0   |   0   |   0   | ===> |     2|    1|
| Sporty|      |   0   |   0   |   0   |   1   |   0   |   0   |      |     3|    1|
|  Large|      |   0   |   0   |   0   |   0   |   1   |   0   |      |     4|    1|
|    Van|      |   0   |   0   |   0   |   0   |   0   |   1   |      |     5|    1|
+-------+      +-------+-------+-------+-------+-------+-------+      +------+-----+

Seyrek gösterim: sütun indisini ve değeri saklayın.

PySpark ile Machine Learning

Sahte değişkenler: gereksiz sütun

+-------+      +-------+-------+-------+-------+-------+      +------+-----+
|   type|      |Midsize|  Small|Compact| Sporty|  Large|      |Column|Value|
+-------+      +-------+-------+-------+-------+-------+      +------+-----+
|Midsize|      |   1   |   0   |   0   |   0   |   0   |      |     0|    1|
|  Small|      |   0   |   1   |   0   |   0   |   0   |      |     1|    1|
|Compact| ===> |   0   |   0   |   1   |   0   |   0   | ===> |     2|    1|
| Sporty|      |   0   |   0   |   0   |   1   |   0   |      |     3|    1|
|  Large|      |   0   |   0   |   0   |   0   |   1   |      |     4|    1|
|    Van|      |   0   |   0   |   0   |   0   |   0   |      |      |     |
+-------+      +-------+-------+-------+-------+-------+      +------+-----+

Düzeyler birbirini dışlar; birini düşürün.

PySpark ile Machine Learning

One-hot encoding

from pyspark.ml.feature import OneHotEncoder

onehot = OneHotEncoder(inputCols=['type_idx'], outputCols=['type_dummy'])

Kodlayıcıyı veriye uydurun.

onehot = onehot.fit(cars)
# Kaç kategori düzeyi?
onehot.categorySizes
[6]
PySpark ile Machine Learning

One-hot encoding

cars = onehot.transform(cars)
cars.select('type', 'type_idx', 'type_dummy').distinct().sort('type_idx').show()
+-------+--------+-------------+
|   type|type_idx|   type_dummy|
+-------+--------+-------------+
|Midsize|     0.0|(5,[0],[1.0])|
|  Small|     1.0|(5,[1],[1.0])|
|Compact|     2.0|(5,[2],[1.0])|
| Sporty|     3.0|(5,[3],[1.0])|
|  Large|     4.0|(5,[4],[1.0])|
|    Van|     5.0|    (5,[],[])|
+-------+--------+-------------+
PySpark ile Machine Learning

Yoğun ve seyrek

from pyspark.mllib.linalg import DenseVector, SparseVector

Şu vektörü saklayın: [1, 0, 0, 0, 0, 7, 0, 0].

DenseVector([1, 0, 0, 0, 0, 7, 0, 0])
DenseVector([1.0, 0.0, 0.0, 0.0, 0.0, 7.0, 0.0, 0.0])
SparseVector(8, [0, 5], [1, 7])
SparseVector(8, {0: 1.0, 5: 7.0})
PySpark ile Machine Learning

Kategorikleri One-Hot Encode edin

PySpark ile Machine Learning

Preparing Video For Download...