Regression

Machine Learning with PySpark

Andrew Collier

Data Scientist, Fathom Data

Consumption versus mass: scatter

Scatter plot of fuel consumption versus mass

Machine Learning with PySpark

Consumption versus mass: fit

Scatter plot of fuel consumption versus mass with linear fit

Machine Learning with PySpark

Consumption versus mass: alternative fits

Scatter plot of fuel consumption versus mass with linear fit and alternatives

Machine Learning with PySpark

Consumption versus mass: residuals

Scatter plot of fuel consumption versus mass with linear fit and residuals

Machine Learning with PySpark

Loss function

 

 

Mean Square Error loss function

MSE = "Mean Squared Error"

Machine Learning with PySpark

Loss function: Observed values

 

 

Mean Square Error loss function

$y_i$ — observed values

Machine Learning with PySpark

Loss function: Model values

 

 

Mean Square Error loss function

$y_i$ — observed values

$\hat{y_i}$ — model values

Machine Learning with PySpark

Loss function: Mean

 

 

Mean Square Error loss function

$y_i$ — observed values

$\hat{y_i}$ — model values

Machine Learning with PySpark

Assemble predictors

Predict consumption using mass, cyl and type_dummy.

Consolidate predictors into a single column.

+------+---+-------------+----------------------------+-----------+
|mass  |cyl|type_dummy   |features                    |consumption|
+------+---+-------------+----------------------------+-----------+
|1451.0|6  |(5,[0],[1.0])|(7,[0,1,2],[1451.0,6.0,1.0])|9.05       |
|1129.0|4  |(5,[2],[1.0])|(7,[0,1,4],[1129.0,4.0,1.0])|6.53       |
|1399.0|4  |(5,[2],[1.0])|(7,[0,1,4],[1399.0,4.0,1.0])|7.84       |
|1147.0|4  |(5,[1],[1.0])|(7,[0,1,3],[1147.0,4.0,1.0])|7.84       |
|1111.0|4  |(5,[3],[1.0])|(7,[0,1,5],[1111.0,4.0,1.0])|9.05       |
+------+---+-------------+----------------------------+-----------+
Machine Learning with PySpark

Build regression model

from pyspark.ml.regression import LinearRegression

regression = LinearRegression(labelCol='consumption')

Fit to cars_train (training data).

regression = regression.fit(cars_train)

Predict on cars_test (testing data).

predictions = regression.transform(cars_test)
Machine Learning with PySpark

Examine predictions

+-----------+------------------+
|consumption|prediction        |
+-----------+------------------+
|7.84       |8.92699470743403  |
|9.41       |9.379295891451353 |
|8.11       |7.23487264538364  |
|9.05       |9.409860194333735 |
|7.84       |7.059190923328711 |
|7.84       |7.785909738591766 |
|7.59       |8.129959405168547 |
|5.11       |6.836843743852942 |
|8.11       |7.17173702652015  |
+-----------+------------------+

Scatter plot of predictions vs actuals

Machine Learning with PySpark

Calculate RMSE

from pyspark.ml.evaluation import RegressionEvaluator

# Find RMSE (Root Mean Squared Error)
RegressionEvaluator(labelCol='consumption').evaluate(predictions)
0.708699086182001

A RegressionEvaluator can also calculate the following metrics:

  • mae (Mean Absolute Error)
  • r2 ($R^2$)
  • mse (Mean Squared Error).
Machine Learning with PySpark

Consumption versus mass: intercept

Plot showing model intercept

Machine Learning with PySpark

Examine intercept

regression.intercept
4.9450616833727095

This is the fuel consumption in the (hypothetical) case that:

  • mass = 0
  • cyl = 0 and
  • vehicle type is 'Van'.
Machine Learning with PySpark

Consumption versus mass: slope

Plot showing model slope

Machine Learning with PySpark

Examine Coefficients

regression.coefficients
DenseVector([0.0027, 0.1897, -1.309, -1.7933, -1.3594, -1.2917, -1.9693])
mass        0.0027
cyl         0.1897

Midsize    -1.3090
Small      -1.7933
Compact    -1.3594
Sporty     -1.2917
Large      -1.9693
Machine Learning with PySpark

Regression for numeric predictions

Machine Learning with PySpark

Preparing Video For Download...