Bucketing & Engineering

Machine Learning with PySpark

Andrew Collier

Data Scientist, Fathom Data

Bucketing

Grouping continuous observations into buckets

Bucketing heights

A histogram of heights

+------+
|height|
+------+
|  1.42|
|  1.45|
|  1.47|
|  1.50|
|  1.52|
|  1.57|
|  1.60|
|  1.75|
|  1.85|
|  1.88|
+------+

Bucketing heights

A histogram of heights with ranges

+------+
|height|
+------+
|  1.42|
|  1.45|
|  1.47|
|  1.50|
|  1.52|
|  1.57|
|  1.60|
|  1.75|
|  1.85|
|  1.88|
+------+

Bucketing heights

A histogram of heights with ranges and labels

+------+
|height|
+------+
|  1.42|
|  1.45|
|  1.47|
|  1.50|
|  1.52|
|  1.57|
|  1.60|
|  1.75|
|  1.85|
|  1.88|
+------+

Bucketing heights

A histogram of heights with ranges and labels

+------+----------+
|height|height_bin|
+------+----------+
|  1.42|     short|
|  1.45|     short|
|  1.47|     short|
|  1.50|     short|
|  1.52|   average|
|  1.57|   average|
|  1.60|   average|
|  1.75|   average|
|  1.85|      tall|
|  1.88|      tall|
+------+----------+

RPM histogram

Car RPM has "natural" breaks:

$\text{RPM} < 4500$ — low
$\text{RPM} > 6000$ — high
otherwise — medium.

A histogram of RPM with ranges and labels

RPM buckets

from pyspark.ml.feature import Bucketizer

bucketizer = Bucketizer(splits=[3500, 4500, 6000, 6500],
                        inputCol="rpm",
                        outputCol="rpm_bin")

Apply buckets to rpm column.

bucketed = bucketizer.transform(cars)

RPM buckets

bucketed.select('rpm', 'rpm_bin').show(5)

+----+-------+
| rpm|rpm_bin|
+----+-------+
|3800|    0.0|
|4500|    1.0|
|5750|    1.0|
|5300|    1.0|
|6200|    2.0|
+----+-------+

bucketed.groupBy('rpm_bin').count().show()

+-------+-----+
|rpm_bin|count|
+-------+-----+
|    0.0|    8| <- low
|    1.0|   67| <- medium
|    2.0|   17| <- high
+-------+-----+

One-hot encoded RPM buckets

The RPM buckets are one-hot encoded to dummy variables.

+-------+-------------+
|rpm_bin|    rpm_dummy|
+-------+-------------+
|    0.0|(2,[0],[1.0])| <- low
|    1.0|(2,[1],[1.0])| <- medium
|    2.0|    (2,[],[])| <- high
+-------+-------------+

The 'high' RPM bucket is the reference level and doesn't get a dummy variable.

Model with bucketed RPM

regression.coefficients

DenseVector([1.3814, 0.1433])

+-------+-------------+
|rpm_bin|    rpm_dummy|
+-------+-------------+
|    0.0|(2,[0],[1.0])| <- low
|    1.0|(2,[1],[1.0])| <- medium
|    2.0|    (2,[],[])| <- high
+-------+-------------+

regression.intercept

8.1835

Consumption for 'low' RPM:

8.1835 + 1.3814 = 9.5649

Consumption for 'medium' RPM:

8.1835 + 0.1433 = 8.3268

More feature engineering

Operations on a single column:

log()
sqrt()
pow()

Operations on two columns:

product
ratio.

Mass & Height to BMI

A histogram of heights

A histogram of masses

Mass & Height to BMI

A histogram of BMI

+------+-----+----+
|height| mass| bmi|    bmi = mass / height^2
+------+-----+----+
|  1.52| 77.1|33.2|
|  1.60| 58.1|22.7|
|  1.57|122.0|49.4|
|  1.75| 95.3|31.0|
|  1.80| 99.8|30.7|
|  1.65| 90.7|33.3|
|  1.60| 70.3|27.5|
|  1.78| 81.6|25.8|
|  1.65| 77.1|28.3|
|  1.78|128.0|40.5|
+------+-----+----+

Engineering density

cars = cars.withColumn('density_line', cars.mass / cars.length)       # Linear density
cars = cars.withColumn('density_quad', cars.mass / cars.length**2)    # Area density
cars = cars.withColumn('density_cube', cars.mass / cars.length**3)    # Volume density

+------+------+------------+------------+------------+
|  mass|length|density_line|density_quad|density_cube|
+------+------+------------+------------+------------+
|1451.0| 4.775|303.87434554|63.638606397|13.327456837|
|1129.0| 4.623|244.21371403|52.825808790|11.426737787|
|1399.0| 4.547|307.67539036|67.665579583|14.881367843|
+------+------+------------+------------+------------+

Let's engineer some features!

Machine Learning with PySpark