Feature engineering and overfitting

Designing Machine Learning Workflows in Python

Dr. Chris Anagnostopoulos

Honorary Associate Professor

Feature extraction from non-tabular data

A single patient's ECG is typically recorded on a piece of paper as a time-series.

arrhythmias.head()
   age  sex  height  weight  ...    chV6_TwaveAmp  chV6_QRSA  chV6_QRSTA  class
0   75    0     190      80  ...              2.9       23.3        49.4      0
1   56    1     165      64  ...              2.1       20.4        38.8      0
2   54    0     172      95  ...              3.4       12.3        49.0      0
3   55    0     175      94  ...              2.6       34.6        61.6      1
4   75    0     190      80  ...              3.9       25.4        62.8      0
Designing Machine Learning Workflows in Python

Label encoding for categorical variables

numpy.unique(credit_scoring['purpose'])
array(['business', 'buy_domestic_appliance', 'buy_furniture_equipment',
       'buy_new_car', 'buy_radio_tv', 'buy_used_car', 'education',
       'other', 'repairs', 'retraining'], dtype=object)
numpy.unique(LabelEncoder().fit_transform(credit_scoring['purpose']))
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Designing Machine Learning Workflows in Python

Label encoding for categorical variables

A decision tree splitting the credit purpose variable in arbitrary continuous intervals of numbers, where numbers are used as an encoding of the underlying category.

Designing Machine Learning Workflows in Python

One hot encoding for categorical variables

pd.get_dummies(credit_scoring['purpose']).iloc[1]
purpose_business                   0
purpose_buy_domestic_appliance     0
purpose_buy_furniture_equipment    0
purpose_buy_new_car                0
purpose_buy_radio_tv               1
purpose_buy_used_car               0
purpose_education                  0
purpose_other                      0
purpose_repairs                    0
purpose_retraining                 0
Designing Machine Learning Workflows in Python

Keyword encoding for categorical variables

from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()

credit_scoring['purpose'] = credit_scoring['purpose'].apply( lambda s: ' '.join(s.split('_')), 0)
dummy_matrix = vec.fit_transform(credit_scoring['purpose']).toarray()
pd.DataFrame(dummy_matrix, columns=vec.get_feature_names()).head()
      appliance  business  buy  car  ...   repairs  retraining  tv  used
0          0         0    1    0  ...         0           0   1     0
1          0         0    1    0  ...         0           0   1     0
2          0         0    0    0  ...         0           0   0     0
3          0         0    1    0  ...         0           0   0     0
Designing Machine Learning Workflows in Python

Dimensionality and feature engineering

Categorical variables in credit:

  • Label encoding: 1 column.
  • One-hot encoding: 10 columns.
  • Keyword encoding: 15 columns.

ECG features in arrhythmias:

  • Over 250 features
Designing Machine Learning Workflows in Python

In-sample accuracy stats at 0.8 for 0 added columns, and increases almost to 1.0 as we add more fake columns. Out-of-sample accuracy instead drops immediately as soon as fake columns are added.

Designing Machine Learning Workflows in Python

Feature selection

from np.random import uniform
fakes = pd.DataFrame(
    uniform(low=0.0, high=1.0, size=n * 100).reshape(X.shape[0], 100),
    columns=['fake_' + str(j) for j in range(100)]
)
X_with_fakes = pd.concat([X, fakes], 1)
Designing Machine Learning Workflows in Python

Feature selection

from sklearn.feature_selection import chi2, SelectKBest
sk = SelectKBest(chi2, k=20)
which_selected = sk.fit(X_with_fakes, y).get_support()
X_with_fakes.columns[which_selected]
['checking_status', 'duration', 'credit_history', 'purpose',
    'credit_amount', 'savings_status', 'installment_commitment',
    'personal_status', 'property_magnitude', 'age', 'other_payment_plans',
    'job', 'own_telephone', 'fake_28', 'fake_46', 'fake_51', 'fake_60',
    'fake_83', 'fake_95', 'fake_99']
Designing Machine Learning Workflows in Python

Tradeoffs everywhere!

Designing Machine Learning Workflows in Python

Preparing Video For Download...