Aprendizaje supervisado con scikit-learn
George Boorman
Core Curriculum Manager
print(music_df[["duration_ms", "loudness", "speechiness"]].describe())
duration_ms loudness speechiness
count 1.000000e+03 1000.000000 1000.000000
mean 2.176493e+05 -8.284354 0.078642
std 1.137703e+05 5.065447 0.088291
min -1.000000e+00 -38.718000 0.023400
25% 1.831070e+05 -9.658500 0.033700
50% 2.176493e+05 -7.033500 0.045000
75% 2.564468e+05 -5.034000 0.078642
max 1.617333e+06 -0.883000 0.710000
Muchos modelos utilizan alguna forma de distancia para informarse.
Las características a mayor escala pueden influir desproporcionadamente en el modelo.
Ejemplo: KNN utiliza la distancia explícitamente al hacer predicciones.
Queremos que las características tengan una escala similar.
Normalizar o estandarizar (escalar y centrar)
Resta la media y divide por la varianza.
También puedes restar el mínimo y dividir por el intervalo.
También se puede normalizar para que los datos oscilen entre -1 y +1
Consulta la documentación de scikit-learn para más detalles.
from sklearn.preprocessing import StandardScalerX = music_df.drop("genre", axis=1).values y = music_df["genre"].valuesX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)scaler = StandardScaler()X_train_scaled = scaler.fit_transform(X_train)X_test_scaled = scaler.transform(X_test)print(np.mean(X), np.std(X)) print(np.mean(X_train_scaled), np.std(X_train_scaled))
19801.42536120538, 71343.52910125865
2.260817795600319e-17, 1.0
steps = [('scaler', StandardScaler()), ('knn', KNeighborsClassifier(n_neighbors=6))] pipeline = Pipeline(steps)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)knn_scaled = pipeline.fit(X_train, y_train)y_pred = knn_scaled.predict(X_test)print(knn_scaled.score(X_test, y_test))
0.81
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=21)
knn_unscaled = KNeighborsClassifier(n_neighbors=6).fit(X_train, y_train)
print(knn_unscaled.score(X_test, y_test))
0.53
from sklearn.model_selection import GridSearchCV steps = [('scaler', StandardScaler()), ('knn', KNeighborsClassifier())] pipeline = Pipeline(steps)parameters = {"knn__n_neighbors": np.arange(1, 50)}X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)cv = GridSearchCV(pipeline, param_grid=parameters)cv.fit(X_train, y_train)y_pred = cv.predict(X_test)
print(cv.best_score_)
0.8199999999999999
print(cv.best_params_)
{'knn__n_neighbors': 12}
Aprendizaje supervisado con scikit-learn