Model training with GitHub Actions

CI/CD for Machine Learning

Ravi Bhadauria

Machine Learning Engineer

Dataset: Weather Prediction in Australia

Binary classification
- Predicts rainfall for tomorrow
5 Categorical Features
- Location
- WindGustDir
- WindDir9am
- WindDir3pm
- RainToday

17 Numerical Features
- MinTemp
- MaxTemp
- Rainfall
- Evaporation
- ...
- WindGustSpeed
- Cloud3pm
- Temp9am
- RISK_MM

¹ https://www.kaggle.com/datasets/rever3nd/weather-data

Modeling workflow

Data preprocessing
- Convert categorical features to numerical
- Replace missing values of features
- Scale features
Random Forest Classifier
- max_depth = 2, n_estimators = 50
Standard metrics on test data
- Performance plots
  - Confusion matrix plot

Data preparation: target encoding

def target_encode_categorical_features(
    df: pd.DataFrame, categorical_columns: List[str], target_column: str
) -> pd.DataFrame:
    encoded_data = df.copy()

    # Iterate through categorical columns
    for col in categorical_columns:
        # Calculate mean target value for each category
        encoding_map = df.groupby(col)[target_column].mean().to_dict()

        # Apply target encoding
        encoded_data[col] = encoded_data[col].map(encoding_map)

    return encoded_data

¹ https://maxhalford.github.io/blog/target-encoding/

Imputing and Scaling

def impute_and_scale_data(df_features: pd.DataFrame) -> pd.DataFrame:
    # Impute data with mean strategy
    imputer = SimpleImputer(strategy="mean")
    X_preprocessed = imputer.fit_transform(df_features.values)

    # Scale and fit with zero mean and unit variance
    scaler = StandardScaler()
    X_preprocessed = scaler.fit_transform(X_preprocessed)

    return pd.DataFrame(X_preprocessed, columns=df_features.columns)

Training

Train/Test split

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
  data.drop(TARGET_COLUMN), data[TARGET_COLUMN], random_state=1993)

Random Forest Classifier

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(
  max_depth=2, n_estimators=50, random_state=1993)
clf.fit(X_train, y_train)

Metrics

from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

# Calculate predictions
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Calculate precision
precision = precision_score(y_test, y_pred)

# Calculate recall
recall = recall_score(y_test, y_pred)

# Calculate f1 score
f1 = f1_score(y_test, y_pred)

¹ https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics

Plots

from sklearn.metrics import ConfusionMatrixDisplay
ConfusionMatrixDisplay.from_estimator(model, X_test, y_test,cmap=plt.cm.Blues)

Plot of confusion matrix computed on test set

GitHub Actions Workflow

Plot of model training continuous integration workflow

Continuous Machine Learning (CML)
- CI/CD tool for Machine Learning
- GitHub Actions Integration
  - Provision training machines
  - Perform training and evaluation
  - Compare experiments
  - Monitor datasets
  - Visual reports

¹ https://cml.dev/ ² https://martinfowler.com/bliki/FeatureBranch.html

CML commands

# Enable setup-cml action to be used later
- uses: iterative/setup-cml@v1

- name: Train model
  run: |
    # Your ML workflow goes here
    pip install -r requirements.txt
    python3 train.py

¹ https://www.markdownguide.org/basic-syntax/#images

CML commands

- name: Write CML report
  run: |
    # Add results and plots to markdown
    cat results.txt >> report.md
    echo "![training graph](./graph.png)" >> report.md


    # Create comment from markdown report
    cml comment create report.md

  env:
    REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}

Output

Screenshot of pull request page displaying a comment generated by CML

Let's practice!

CI/CD for Machine Learning