Classification: feature engineering

Practicing Machine Learning Interview Questions in Python

Lisa Stuart

Data Scientist

Feature engineering...why?

  • Extracts additional information from the data
  • Creates additional relevant features
  • One of the most effective ways to improve predictive models
Practicing Machine Learning Interview Questions in Python

Benefits of feature engineering

  • Increased predictive power of the learning algorithm
  • Makes your machine learning models perform even better!
Practicing Machine Learning Interview Questions in Python

Types of feature engineering

  • Indicator variables
  • Interaction features
  • Feature representation
Practicing Machine Learning Interview Questions in Python

Indicator variables

  • Threshold indicator
    • age: high school vs college
  • Multiple features
    • used as a flag
  • Special events
    • black Friday
    • Christmas
  • Groups of classes
    • website traffic paid flag
      • Google adwords{4}}
      • Facebook ads
Practicing Machine Learning Interview Questions in Python

Interaction features

  • Sum
  • Difference
  • Product
  • Quotient
  • Other mathematical combos
Practicing Machine Learning Interview Questions in Python

Feature representation

  • Datetime stamps
    • Day of week
    • Hour of day
  • Grouping categorical levels into 'Other'
  • Transform categorical to dummy variables
    • (k - 1) binary columns
Practicing Machine Learning Interview Questions in Python

Different categorical levels

  • Training data:
    • model trained with [red, blue, green]
  • Test data:

    • model test with [red, green, yellow]
    • additional color not seen in training
    • one color missing
  • Robust one-hot encoding

1 https://blog.cambridgespark.com/robust-one-hot-encoding-in-python-3e29bfcec77e
Practicing Machine Learning Interview Questions in Python

Debt to income ratio

  • Monthly Debt
  • Annual Income/12
Practicing Machine Learning Interview Questions in Python

Feature engineering functions

Function returns
sklearn.linear_model.LogisticRegression logistic regression
sklearn.model_selection.train_test_split train/test split function
sns.countplot(x='Loan Status', data=data) bar plot
df.drop(['Feature 1', 'Feature 2'], axis=1) drops list of features
df["Loan Status"].replace({'Paid': 0, 'Not Paid': 1}) Loan Status as integers
pd.get_dummies() k - 1 binary features
sklearn.metrics.accuracy_score(y_test, predict(X_test)) model accuracy
Practicing Machine Learning Interview Questions in Python

An excellent tutorial:

Practicing Machine Learning Interview Questions in Python

Let's practice!

Practicing Machine Learning Interview Questions in Python

Preparing Video For Download...