Data preparation

End-to-End Machine Learning

Joshua Stapleton

Machine Learning Engineer

Data preparation steps

Dataset has:

  • Missing values
  • Outliers
  • Imbalances
  • Empty columns
  • Duplicates

Data preparation:

  • Based on insights from EDA
  • Critical for model performance downstream
End-to-End Machine Learning

Null / empty values

  • Drop missing or sparse rows/columns
  • Null values can break model
  • Use df.drop() for columns
  • Use df.dropna(how='all') for rows
# count missing values
print(df['oldpeak'].isnull().sum())

# Drop empty column(s) and row(s)
columns_dropped = heart_disease_df.drop(['oldpeak'], axis='columns')
rows_and_columns_dropped = columns_dropped.dropna(how='all')
End-to-End Machine Learning

Dealing with null / empty values

  • Data cleaning / dropping values depends on EDA findings

 

  • If given column has too many missing values:
    • Drop column

 

  • If target column has missing values:
    • Drop rows with missing targets
    • Or treat as separate category
End-to-End Machine Learning

Imputation

What to do when there are only a few missing values?

  • Imputation:

    • Fill missing values with substitutes
  • Strategies

    • Fill with mean or median
    • Use constant or previous value
# Calculate the mean cholestrol value 
mean_value = heart_disease_df['chol'].mean()

# Fill missing cholestrol values with the mean
heart_disease_df['chol'].fillna(mean_value, inplace=True)
End-to-End Machine Learning

Advanced imputation

Advanced techniques:

  • K-nearest neighbors
  • SMOTE (synthetic minority oversampling technique)
from sklearn.impute import KNNImputer

# Initialize KNNImputer
imputer = KNNImputer(n_neighbors=2, weights="uniform")

# Perform the imputation on your DataFrame
df_imputed['oldpeak'] = imputer.fit_transform(df['oldpeak'])
End-to-End Machine Learning

Dropping duplicates

 

  • Data must be clean, concise, and rich
  • Redundancies are unhelpful
  • Duplicates can bias or confuse model
  • Look at unique identifiers as a criteria for dropping records / rows.

 

# Drop duplicate rows
heart_disease_duplicates_dropped = heart_disease_column_dropped.drop_duplicates()
End-to-End Machine Learning

Let's practice!

End-to-End Machine Learning

Preparing Video For Download...