Slides 🖼️

Learning objectives

  • Understand the core philosophy behind MLOps ideas
  • Apply best practices for establishing ML project structure and dependencies management
  • Manage project dependencies with pip and virtualenv
  • Version datasets with DVC

Project Introduction

Problem Description and Dataset

This dataset contains 10,000 records, each of which corresponds to a different bank’s user. The target is Exited, a binary variable that describes whether the user decided to leave the bank. There are row and customer identifiers, four columns describing personal information about the user (surname, location, gender, and age), and some other columns containing information related to the loan (such as credit score, the current balance in the user’s account and whether they are an active member among others).

Dataset source: https://www.kaggle.com/datasets/filippoo/deep-learning-az-ann

Use Case

The objective is to train an ML model that returns the probability of a customer churning. This is a binary classification task, therefore F1-score is a good metric to evaluate the performance of this dataset as it weights recall and precision equally, and a good retrieval algorithm will maximize both precision and recall simultaneously.

https://github.com/alex000kim/open-source-mlops-e2e

Setup

  • Fork repository

If running locally, proceed to the next section.

If using gitpod.io:

  • Log into gitpod.io
  • Create a gitpod workspace using the forked repository

Steps

Manage project dependencies with pip and venv

(The below is not needed when running on gitpod.io because this setup is defined in gitpod.yml)

  • Create dev branch
    git branch dev
    git checkout dev
    
  • Create a virtual environment
    python -m venv .venv
    
  • Install dependencies from requirements.txt
    source .venv/bin/activate
    pip install -U pip
    pip install -r requirements.txt
    

Pin down dependencies in requirements.txt

Two options:

  1. pip freeze > requirements.txt
  2. Manually pin down versions of each installed library
git add requirements.txt 
git commit -m "pin down library versions"
git push --set-upstream origin dev

Create a better project structure

  • Review: https://drivendata.github.io/cookiecutter-data-science/

  • Move Churn_Modelling_*.csv files to data/raw/ directory

  • Move TrainChurnModel.ipynb to notebooks/ directory and update it accordingly

  • Refactor and run TrainChurnModel.ipynb

    • Read data from data/raw/
    • Save cm.png to reports/figures
    • Save feat_imp.csv to reports (delete the file at the root of the repo)
    • Save clf-model.joblib to models (delete the file at the root of the repo)

    Final result should look like this: TrainChurnModel.ipynb

  • Push changes to git

Further reading:

Move data versioning from Git to DVC

  • Docs: https://dvc.org/doc/start/data-and-model-versioning

  • Initialize DVC project

    dvc init
    
  • Create AWS S3 bucket

  • Set AWS credentials: https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/setup-credentials.html

  • Add DVC remote

    # https://dvc.org/doc/command-reference/remote/add#amazon-s3
    dvc remote add -d s3_remote s3://<YOUR_BUCKET_PATH>
    
  • Set AWS credentials

    # https://dvc.org/doc/command-reference/remote/modify#amazon-s3
    dvc remote modify --local s3_remote access_key_id 'my_key_id'
    dvc remote modify --local s3_remote secret_access_key 'my_access_key'
    
  • Start versioning data/raw and models/ directories

    dvc add data/raw models/clf-model.joblib
    

    Note: You’ll see an error raised by DVC because an artifact (file or directory) can only be versioning either DVC or git, but not both.

    data/raw and models/clf-model.joblib are already versioned with git, so we need to stop tracking them with git first.

    git rm -r --cached 'data/raw'
    git commit -m "stop tracking data/raw"
    git rm -r --cached 'models/clf-model.joblib'
    git commit -m "stop tracking models/clf-model.joblib"
    # re-run
    dvc add data/raw models/clf-model.joblib
    
  • Push versioned data to the DVC remote

    dvc push
    
  • Review git status and push changes to the git server

    git status
    git add .
    git status
    git commit -m 'initiate dvc and start tracking data and models'
    git push