Model training with GitHub Actions

CI/CD for Machine Learning

Ravi Bhadauria

Machine Learning Engineer

Dataset: Weather Prediction in Australia

  • Binary classification
    • Predicts rainfall for tomorrow
  • 5 Categorical Features
    • Location
    • WindGustDir
    • WindDir9am
    • WindDir3pm
    • RainToday
  • 17 Numerical Features
    • MinTemp
    • MaxTemp
    • Rainfall
    • Evaporation
    • ...
    • WindGustSpeed
    • Cloud3pm
    • Temp9am
    • RISK_MM
1 https://www.kaggle.com/datasets/rever3nd/weather-data
CI/CD for Machine Learning

Modeling workflow

  • Data preprocessing
    • Convert categorical features to numerical
    • Replace missing values of features
    • Scale features
  • Random Forest Classifier
    • max_depth = 2, n_estimators = 50
  • Standard metrics on test data
    • Performance plots
      • Confusion matrix plot
CI/CD for Machine Learning

Data preparation: target encoding

def target_encode_categorical_features(
    df: pd.DataFrame, categorical_columns: List[str], target_column: str
) -> pd.DataFrame:
    encoded_data = df.copy()

    # Iterate through categorical columns
    for col in categorical_columns:
        # Calculate mean target value for each category
        encoding_map = df.groupby(col)[target_column].mean().to_dict()

        # Apply target encoding
        encoded_data[col] = encoded_data[col].map(encoding_map)

    return encoded_data
1 https://maxhalford.github.io/blog/target-encoding/
CI/CD for Machine Learning

Imputing and Scaling

def impute_and_scale_data(df_features: pd.DataFrame) -> pd.DataFrame:
    # Impute data with mean strategy
    imputer = SimpleImputer(strategy="mean")
    X_preprocessed = imputer.fit_transform(df_features.values)

    # Scale and fit with zero mean and unit variance
    scaler = StandardScaler()
    X_preprocessed = scaler.fit_transform(X_preprocessed)

    return pd.DataFrame(X_preprocessed, columns=df_features.columns)
CI/CD for Machine Learning

Training

  • Train/Test split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
  data.drop(TARGET_COLUMN), data[TARGET_COLUMN], random_state=1993)
  • Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(
  max_depth=2, n_estimators=50, random_state=1993)
clf.fit(X_train, y_train)
CI/CD for Machine Learning

Metrics

from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

# Calculate predictions
y_pred = model.predict(X_test)

# Calculate accuracy accuracy = accuracy_score(y_test, y_pred)
# Calculate precision precision = precision_score(y_test, y_pred)
# Calculate recall recall = recall_score(y_test, y_pred)
# Calculate f1 score f1 = f1_score(y_test, y_pred)
1 https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics
CI/CD for Machine Learning

Plots

from sklearn.metrics import ConfusionMatrixDisplay
ConfusionMatrixDisplay.from_estimator(model, X_test, y_test,cmap=plt.cm.Blues)

Plot of confusion matrix computed on test set

CI/CD for Machine Learning

GitHub Actions Workflow

Plot of model training continuous integration workflow

  • Continuous Machine Learning (CML)
    • CI/CD tool for Machine Learning
    • GitHub Actions Integration
      • Provision training machines
      • Perform training and evaluation
      • Compare experiments
      • Monitor datasets
      • Visual reports
1 https://cml.dev/ 2 https://martinfowler.com/bliki/FeatureBranch.html
CI/CD for Machine Learning

CML commands

# Enable setup-cml action to be used later
- uses: iterative/setup-cml@v1
- name: Train model
  run: |
    # Your ML workflow goes here
    pip install -r requirements.txt
    python3 train.py
1 https://www.markdownguide.org/basic-syntax/#images
CI/CD for Machine Learning

CML commands

- name: Write CML report
  run: |
    # Add results and plots to markdown
    cat results.txt >> report.md
    echo "![training graph](./graph.png)" >> report.md

# Create comment from markdown report cml comment create report.md
env: REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}
CI/CD for Machine Learning

Output

Screenshot of pull request page displaying a comment generated by CML

CI/CD for Machine Learning

Let's practice!

CI/CD for Machine Learning

Preparing Video For Download...