Rekayasa fitur dan overfitting

Merancang Alur Kerja Machine Learning di Python

Dr. Chris Anagnostopoulos

Honorary Associate Professor

Ekstraksi fitur dari data non-tabel

ECG satu pasien biasanya dicetak di kertas sebagai deret waktu.

arrhythmias.head()
   age  sex  height  weight  ...    chV6_TwaveAmp  chV6_QRSA  chV6_QRSTA  class
0   75    0     190      80  ...              2.9       23.3        49.4      0
1   56    1     165      64  ...              2.1       20.4        38.8      0
2   54    0     172      95  ...              3.4       12.3        49.0      0
3   55    0     175      94  ...              2.6       34.6        61.6      1
4   75    0     190      80  ...              3.9       25.4        62.8      0
Merancang Alur Kerja Machine Learning di Python

Label encoding untuk variabel kategorial

numpy.unique(credit_scoring['purpose'])
array(['business', 'buy_domestic_appliance', 'buy_furniture_equipment',
       'buy_new_car', 'buy_radio_tv', 'buy_used_car', 'education',
       'other', 'repairs', 'retraining'], dtype=object)
numpy.unique(LabelEncoder().fit_transform(credit_scoring['purpose']))
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Merancang Alur Kerja Machine Learning di Python

Label encoding untuk variabel kategorial

Sebuah pohon keputusan membagi variabel tujuan kredit dalam interval numerik kontinu yang sewenang-wenang, dengan angka sebagai pengkodean kategori dasarnya.

Merancang Alur Kerja Machine Learning di Python

One-hot encoding untuk variabel kategorial

pd.get_dummies(credit_scoring['purpose']).iloc[1]
purpose_business                   0
purpose_buy_domestic_appliance     0
purpose_buy_furniture_equipment    0
purpose_buy_new_car                0
purpose_buy_radio_tv               1
purpose_buy_used_car               0
purpose_education                  0
purpose_other                      0
purpose_repairs                    0
purpose_retraining                 0
Merancang Alur Kerja Machine Learning di Python

Keyword encoding untuk variabel kategorial

from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()

credit_scoring['purpose'] = credit_scoring['purpose'].apply( lambda s: ' '.join(s.split('_')), 0)
dummy_matrix = vec.fit_transform(credit_scoring['purpose']).toarray()
pd.DataFrame(dummy_matrix, columns=vec.get_feature_names()).head()
      appliance  business  buy  car  ...   repairs  retraining  tv  used
0          0         0    1    0  ...         0           0   1     0
1          0         0    1    0  ...         0           0   1     0
2          0         0    0    0  ...         0           0   0     0
3          0         0    1    0  ...         0           0   0     0
Merancang Alur Kerja Machine Learning di Python

Dimensionalitas dan rekayasa fitur

Variabel kategorial di credit:

  • Label encoding: 1 kolom.
  • One-hot encoding: 10 kolom.
  • Keyword encoding: 15 kolom.

Fitur ECG di arrhythmias:

  • Lebih dari 250 fitur
Merancang Alur Kerja Machine Learning di Python

Akurasi in-sample sekitar 0,8 untuk 0 kolom tambahan, lalu naik hampir ke 1,0 saat lebih banyak kolom palsu ditambah. Akurasi out-of-sample justru turun segera setelah kolom palsu ditambah.

Merancang Alur Kerja Machine Learning di Python

Seleksi fitur

from np.random import uniform
fakes = pd.DataFrame(
    uniform(low=0.0, high=1.0, size=n * 100).reshape(X.shape[0], 100),
    columns=['fake_' + str(j) for j in range(100)]
)
X_with_fakes = pd.concat([X, fakes], 1)
Merancang Alur Kerja Machine Learning di Python

Seleksi fitur

from sklearn.feature_selection import chi2, SelectKBest
sk = SelectKBest(chi2, k=20)
which_selected = sk.fit(X_with_fakes, y).get_support()
X_with_fakes.columns[which_selected]
['checking_status', 'duration', 'credit_history', 'purpose',
    'credit_amount', 'savings_status', 'installment_commitment',
    'personal_status', 'property_magnitude', 'age', 'other_payment_plans',
    'job', 'own_telephone', 'fake_28', 'fake_46', 'fake_51', 'fake_60',
    'fake_83', 'fake_95', 'fake_99']
Merancang Alur Kerja Machine Learning di Python

Selalu ada trade-off!

Merancang Alur Kerja Machine Learning di Python

Preparing Video For Download...