Supervised Learning met scikit-learn
George Boorman
Core Curriculum Manager
print(music_df[["duration_ms", "loudness", "speechiness"]].describe())
duration_ms loudness speechiness
count 1.000000e+03 1000.000000 1000.000000
mean 2.176493e+05 -8.284354 0.078642
std 1.137703e+05 5.065447 0.088291
min -1.000000e+00 -38.718000 0.023400
25% 1.831070e+05 -9.658500 0.033700
50% 2.176493e+05 -7.033500 0.045000
75% 2.564468e+05 -5.034000 0.078642
max 1.617333e+06 -0.883000 0.710000
Veel modellen gebruiken (een vorm van) afstand
Features op grotere schalen beïnvloeden het model onevenredig
Voorbeeld: KNN gebruikt afstand expliciet bij voorspellen
We willen features op vergelijkbare schaal
Normaliseren of standaardiseren (schalen en centreren)
Trek het gemiddelde af en deel door de variantie
Of trek het minimum af en deel door het bereik
Of normaliseer zodat de data van -1 tot +1 loopt
Zie de scikit-learn docs voor details
from sklearn.preprocessing import StandardScalerX = music_df.drop("genre", axis=1).values y = music_df["genre"].valuesX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)scaler = StandardScaler()X_train_scaled = scaler.fit_transform(X_train)X_test_scaled = scaler.transform(X_test)print(np.mean(X), np.std(X)) print(np.mean(X_train_scaled), np.std(X_train_scaled))
19801.42536120538, 71343.52910125865
2.260817795600319e-17, 1.0
steps = [('scaler', StandardScaler()), ('knn', KNeighborsClassifier(n_neighbors=6))] pipeline = Pipeline(steps)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)knn_scaled = pipeline.fit(X_train, y_train)y_pred = knn_scaled.predict(X_test)print(knn_scaled.score(X_test, y_test))
0.81
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=21)
knn_unscaled = KNeighborsClassifier(n_neighbors=6).fit(X_train, y_train)
print(knn_unscaled.score(X_test, y_test))
0.53
from sklearn.model_selection import GridSearchCV steps = [('scaler', StandardScaler()), ('knn', KNeighborsClassifier())] pipeline = Pipeline(steps)parameters = {"knn__n_neighbors": np.arange(1, 50)}X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)cv = GridSearchCV(pipeline, param_grid=parameters)cv.fit(X_train, y_train)y_pred = cv.predict(X_test)
print(cv.best_score_)
0.8199999999999999
print(cv.best_params_)
{'knn__n_neighbors': 12}
Supervised Learning met scikit-learn