Supervised Learning with scikit-learn
George Boorman
Core Curriculum Manager, DataCamp
scikit-learn will not accept categorical features by default
Need to convert categorical features into numeric values
Convert to binary features called dummy variables
0: Observation was NOT that category
1: Observation was that category
scikit-learn: OneHotEncoder()
pandas: get_dummies()
popularity
: Target variablegenre
: Categorical featureprint(music.info())
popularity acousticness danceability ... tempo valence genre
0 41.0 0.6440 0.823 ... 102.619000 0.649 Jazz
1 62.0 0.0855 0.686 ... 173.915000 0.636 Rap
2 42.0 0.2390 0.669 ... 145.061000 0.494 Electronic
3 64.0 0.0125 0.522 ... 120.406497 0.595 Rock
4 60.0 0.1210 0.780 ... 96.056000 0.312 Rap
import pandas as pd
music_df = pd.read_csv('music.csv')
music_dummies = pd.get_dummies(music_df["genre"], drop_first=True)
print(music_dummies.head())
Anime Blues Classical Country Electronic Hip-Hop Jazz Rap Rock
0 0 0 0 0 0 0 1 0 0
1 0 0 0 0 0 0 0 1 0
2 0 0 0 0 1 0 0 0 0
3 0 0 0 0 0 0 0 0 1
4 0 0 0 0 0 0 0 1 0
music_dummies = pd.concat([music_df, music_dummies], axis=1)
music_dummies = music_dummies.drop("genre", axis=1)
music_dummies = pd.get_dummies(music_df, drop_first=True)
print(music_dummies.columns)
Index(['popularity', 'acousticness', 'danceability', 'duration_ms', 'energy',
'instrumentalness', 'liveness', 'loudness', 'speechiness', 'tempo',
'valence', 'genre_Anime', 'genre_Blues', 'genre_Classical',
'genre_Country', 'genre_Electronic', 'genre_Hip-Hop', 'genre_Jazz',
'genre_Rap', 'genre_Rock'],
dtype='object')
from sklearn.model_selection import cross_val_score, KFold from sklearn.linear_model import LinearRegression X = music_dummies.drop("popularity", axis=1).values y = music_dummies["popularity"].values X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
kf = KFold(n_splits=5, shuffle=True, random_state=42)
linreg = LinearRegression()
linreg_cv = cross_val_score(linreg, X_train, y_train, cv=kf, scoring="neg_mean_squared_error")
print(np.sqrt(-linreg_cv))
[8.15792932, 8.63117538, 7.52275279, 8.6205778, 7.91329988]
Supervised Learning with scikit-learn