Supervised Learning dengan scikit-learn
George Boorman
Core Curriculum Manager
print(music_df[["duration_ms", "loudness", "speechiness"]].describe())
duration_ms loudness speechiness
count 1.000000e+03 1000.000000 1000.000000
mean 2.176493e+05 -8.284354 0.078642
std 1.137703e+05 5.065447 0.088291
min -1.000000e+00 -38.718000 0.023400
25% 1.831070e+05 -9.658500 0.033700
50% 2.176493e+05 -7.033500 0.045000
75% 2.564468e+05 -5.034000 0.078642
max 1.617333e+06 -0.883000 0.710000
Banyak model memakai ukuran jarak
Fitur berskala besar bisa terlalu memengaruhi model
Contoh: KNN memakai jarak secara eksplisit untuk prediksi
Kita ingin fitur pada skala serupa
Normalisasi atau standardisasi (skala dan pusatkan)
Kurangi mean lalu bagi dengan varians
Bisa juga kurangi minimum lalu bagi dengan rentang
Bisa juga normalisasi agar data dari -1 hingga +1
Lihat dokumentasi scikit-learn untuk detailnya
from sklearn.preprocessing import StandardScalerX = music_df.drop("genre", axis=1).values y = music_df["genre"].valuesX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)scaler = StandardScaler()X_train_scaled = scaler.fit_transform(X_train)X_test_scaled = scaler.transform(X_test)print(np.mean(X), np.std(X)) print(np.mean(X_train_scaled), np.std(X_train_scaled))
19801.42536120538, 71343.52910125865
2.260817795600319e-17, 1.0
steps = [('scaler', StandardScaler()), ('knn', KNeighborsClassifier(n_neighbors=6))] pipeline = Pipeline(steps)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)knn_scaled = pipeline.fit(X_train, y_train)y_pred = knn_scaled.predict(X_test)print(knn_scaled.score(X_test, y_test))
0.81
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=21)
knn_unscaled = KNeighborsClassifier(n_neighbors=6).fit(X_train, y_train)
print(knn_unscaled.score(X_test, y_test))
0.53
from sklearn.model_selection import GridSearchCV steps = [('scaler', StandardScaler()), ('knn', KNeighborsClassifier())] pipeline = Pipeline(steps)parameters = {"knn__n_neighbors": np.arange(1, 50)}X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)cv = GridSearchCV(pipeline, param_grid=parameters)cv.fit(X_train, y_train)y_pred = cv.predict(X_test)
print(cv.best_score_)
0.8199999999999999
print(cv.best_params_)
{'knn__n_neighbors': 12}
Supervised Learning dengan scikit-learn