Machine Learning with PySpark
Andrew Collier
Data Scientist, Fathom Data
Classify cars according to country of manufacture.
+---+----+------+------+----+-----------+----------------------------------+-----+
|cyl|size|mass |length|rpm |consumption|features |label|
+---+----+------+------+----+-----------+----------------------------------+-----+
|6 |3.0 |1451.0|4.775 |5200|9.05 |[6.0,3.0,1451.0,4.775,5200.0,9.05]|1.0 |
|4 |2.2 |1129.0|4.623 |5200|6.53 |[4.0,2.2,1129.0,4.623,5200.0,6.53]|0.0 |
|4 |2.2 |1399.0|4.547 |5600|7.84 |[4.0,2.2,1399.0,4.547,5600.0,7.84]|1.0 |
|4 |1.8 |1147.0|4.343 |6500|7.84 |[4.0,1.8,1147.0,4.343,6500.0,7.84]|0.0 |
|4 |1.6 |1111.0|4.216 |5750|9.05 |[4.0,1.6,1111.0,4.216,5750.0,9.05]|0.0 |
+---+----+------+------+----+-----------+----------------------------------+-----+
label = 0 -> manufactured in the USA
= 1 -> manufactured elsewhere
Split data into training and testing sets.
# Specify a seed for reproducibility
cars_train, cars_test = cars.randomSplit([0.8, 0.2], seed=23)
Two DataFrames: cars_train
and cars_test
.
[cars_train.count(), cars_test.count()]
[79, 13]
from pyspark.ml.classification import DecisionTreeClassifier
Create a Decision Tree classifier.
tree = DecisionTreeClassifier()
Learn from the training data.
tree_model = tree.fit(cars_train)
Make predictions on the testing data and compare to known values.
prediction = tree_model.transform(cars_test)
+-----+----------+---------------------------------------+
|label|prediction|probability |
+-----+----------+---------------------------------------+
|1.0 |0.0 |[0.9615384615384616,0.0384615384615385]|
|1.0 |1.0 |[0.2222222222222222,0.7777777777777778]|
|1.0 |1.0 |[0.2222222222222222,0.7777777777777778]|
|0.0 |0.0 |[0.9615384615384616,0.0384615384615385]|
|1.0 |1.0 |[0.2222222222222222,0.7777777777777778]|
+-----+----------+---------------------------------------+
A confusion matrix is a table which describes performance of a model on testing data.
prediction.groupBy("label", "prediction").count().show()
+-----+----------+-----+
|label|prediction|count|
+-----+----------+-----+
| 1.0| 1.0| 8| <- True positive (TP)
| 0.0| 1.0| 2| <- False positive (FP)
| 1.0| 0.0| 3| <- False negative (FN)
| 0.0| 0.0| 6| <- True negative (TN)
+-----+----------+-----+
Accuracy = (TN + TP) / (TN + TP + FN + FP) — proportion of correct predictions.
Machine Learning with PySpark