Supervised Learning with scikit-learn
George Boorman
Core Curriculum Manager, DataCamp
No value for a feature in a particular row
This can occur because:
We need to deal with missing data
print(music_df.isna().sum().sort_values())
genre 8
popularity 31
loudness 44
liveness 46
tempo 46
speechiness 59
duration_ms 91
instrumentalness 91
danceability 143
valence 143
acousticness 200
energy 200
dtype: int64
music_df = music_df.dropna(subset=["genre", "popularity", "loudness", "liveness", "tempo"])print(music_df.isna().sum().sort_values())
popularity 0
liveness 0
loudness 0
tempo 0
genre 0
duration_ms 29
instrumentalness 29
speechiness 53
danceability 127
valence 127
acousticness 178
energy 178
dtype: int64
from sklearn.impute import SimpleImputerX_cat = music_df["genre"].values.reshape(-1, 1) X_num = music_df.drop(["genre", "popularity"], axis=1).values y = music_df["popularity"].valuesX_train_cat, X_test_cat, y_train, y_test = train_test_split(X_cat, y, test_size=0.2, random_state=12)X_train_num, X_test_num, y_train, y_test = train_test_split(X_num, y, test_size=0.2, random_state=12)imp_cat = SimpleImputer(strategy="most_frequent")X_train_cat = imp_cat.fit_transform(X_train_cat)X_test_cat = imp_cat.transform(X_test_cat)
imp_num = SimpleImputer()X_train_num = imp_num.fit_transform(X_train_num)X_test_num = imp_num.transform(X_test_num)X_train = np.append(X_train_num, X_train_cat, axis=1)X_test = np.append(X_test_num, X_test_cat, axis=1)
from sklearn.pipeline import Pipelinemusic_df = music_df.dropna(subset=["genre", "popularity", "loudness", "liveness", "tempo"])music_df["genre"] = np.where(music_df["genre"] == "Rock", 1, 0)X = music_df.drop("genre", axis=1).values y = music_df["genre"].values
steps = [("imputation", SimpleImputer()), ("logistic_regression", LogisticRegression())]pipeline = Pipeline(steps)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)pipeline.fit(X_train, y_train)pipeline.score(X_test, y_test)
0.7593582887700535
Supervised Learning with scikit-learn