Regularisasi

Machine Learning dengan PySpark

Andrew Collier

Data Scientist, Fathom Data

Fitur: Hanya beberapa

Dataset dengan sedikit fitur

Machine Learning dengan PySpark

Fitur: Terlalu banyak

Dataset dengan banyak fitur

Machine Learning dengan PySpark

Fitur: Terpilih

Memilih fitur dari dataset dengan banyak fitur

Machine Learning dengan PySpark

Fungsi loss (ulasan)

Regresi linear meminimalkan MSE.

Fungsi loss Mean Square Error

Machine Learning dengan PySpark

Fungsi loss dengan regularisasi

Regresi linear meminimalkan MSE.

Fungsi loss Mean Square Error dengan istilah regularisasi

Tambahkan istilah regularisasi yang bergantung pada koefisien.

Machine Learning dengan PySpark

Istilah regularisasi

Istilah tambahan regularisasi ditambahkan ke fungsi loss.

Istilah regularisasi bisa berupa

  • Lasso — nilai absolut koefisien
  • Ridge — kuadrat koefisien

Bisa juga campuran Lasso dan Ridge.

Kekuatan regularisasi ditentukan oleh parameter $\lambda$:

  • $\lambda = 0$ — tanpa regularisasi (regresi standar)
  • $\lambda = \infty$ — regularisasi penuh (semua koefisien nol)
Machine Learning dengan PySpark

Mobil lagi

assembler = VectorAssembler(inputCols=[
    'mass', 'cyl', 'type_dummy', 'density_line', 'density_quad', 'density_cube'
], outputCol='features')
cars = assembler.transform(cars)
+-----------------------------------------------------------------------------+-----------+
|features                                                                     |consumption|
+-----------------------------------------------------------------------------+-----------+
|[1451.0,6.0,1.0,0.0,0.0,0.0,0.0,303.8743455497,63.63860639785,13.32745683724]|9.05       |
|[1129.0,4.0,0.0,0.0,1.0,0.0,0.0,244.2137140385,52.82580879050,11.42673778726]|6.53       |
|[1399.0,4.0,0.0,0.0,1.0,0.0,0.0,307.6753903672,67.66557958374,14.88136784335]|7.84       |
|[1147.0,4.0,0.0,1.0,0.0,0.0,0.0,264.1031545014,60.81122599620,14.00212433714]|7.84       |
+-----------------------------------------------------------------------------+-----------+
Machine Learning dengan PySpark

Mobil: Regresi linear

Latih model Regresi Linear (standar) pada data latih.

regression = LinearRegression(labelCol='consumption').fit(cars_train)
# RMSE pada data uji
0.708699086182001

Periksa koefisien:

regression.coefficients
DenseVector([-0.012, 0.174, -0.897, -1.445, -0.985, -1.071, -1.335, 0.189, -0.780, 1.160])
Machine Learning dengan PySpark

Mobil: Regresi Ridge

# alpha = 0 | lambda = 0.1 -> Ridge
ridge = LinearRegression(labelCol='consumption', elasticNetParam=0, regParam=0.1)
ridge.fit(cars_train)
# RMSE
0.724535609745491
# Koefisien Ridge
DenseVector([ 0.001, 0.137, -0.395, -0.822, -0.450, -0.582, -0.806, 0.008,  0.029, 0.001])
# Koefisien Regresi Linear
DenseVector([-0.012, 0.174, -0.897, -1.445, -0.985, -1.071, -1.335, 0.189, -0.780, 1.160])
Machine Learning dengan PySpark

Mobil: Regresi Lasso

# alpha = 1 | lambda = 0.1 -> Lasso
lasso = LinearRegression(labelCol='consumption', elasticNetParam=1, regParam=0.1)
lasso.fit(cars_train)
# RMSE
0.771988667026998
# Koefisien Lasso
DenseVector([   0.0,   0.0,    0.0, -0.056,    0.0,    0.0,    0.0, 0.026,    0.0,   0.0])
# Koefisien Ridge
DenseVector([ 0.001, 0.137, -0.395, -0.822, -0.450, -0.582, -0.806, 0.008,  0.029, 0.001])
# Koefisien Regresi Linear
DenseVector([-0.012, 0.174, -0.897, -1.445, -0.985, -1.071, -1.335, 0.189, -0.780, 1.160])
Machine Learning dengan PySpark

Regularisasi → model sederhana

Machine Learning dengan PySpark

Preparing Video For Download...