Week 1: Kick-starting an ML project

Slides 🖼️

Week 1: ML project lifecycle and MLOps best practices

Learning objectives

Understand the core philosophy behind MLOps ideas
Apply best practices for establishing ML project structure and dependencies management
Manage project dependencies with pip and virtualenv
Version datasets with DVC

Project Introduction

Problem Description and Dataset

This dataset contains 10,000 records, each of which corresponds to a different bank’s user. The target is Exited, a binary variable that describes whether the user decided to leave the bank. There are row and customer identifiers, four columns describing personal information about the user (surname, location, gender, and age), and some other columns containing information related to the loan (such as credit score, the current balance in the user’s account and whether they are an active member among others).

Dataset source: https://www.kaggle.com/datasets/filippoo/deep-learning-az-ann

Use Case

The objective is to train an ML model that returns the probability of a customer churning. This is a binary classification task, therefore F1-score is a good metric to evaluate the performance of this dataset as it weights recall and precision equally, and a good retrieval algorithm will maximize both precision and recall simultaneously.

https://github.com/alex000kim/open-source-mlops-e2e

Setup

Fork repository

If running locally, proceed to the next section.

If using gitpod.io:

Log into gitpod.io
Create a gitpod workspace using the forked repository

Steps

Manage project dependencies with `pip` and `venv`

(The below is not needed when running on gitpod.io because this setup is defined in gitpod.yml)

Create dev branch
```
git branch dev
git checkout dev
```
Create a virtual environment
```
python -m venv .venv
```

Install dependencies from requirements.txt

source .venv/bin/activate
pip install -U pip
pip install -r requirements.txt

Pin down dependencies in `requirements.txt`

Two options:

pip freeze > requirements.txt
Manually pin down versions of each installed library

git add requirements.txt 
git commit -m "pin down library versions"
git push --set-upstream origin dev

Create a better project structure

Review: https://drivendata.github.io/cookiecutter-data-science/
Move Churn_Modelling_*.csv files to data/raw/ directory
Move TrainChurnModel.ipynb to notebooks/ directory and update it accordingly
Refactor and run TrainChurnModel.ipynb
- Read data from data/raw/
- Save cm.png to reports/figures
- Save feat_imp.csv to reports (delete the file at the root of the repo)
- Save clf-model.joblib to models (delete the file at the root of the repo)
Final result should look like this: TrainChurnModel.ipynb
Push changes to git

Further reading:

Move data versioning from Git to DVC

Docs: https://dvc.org/doc/start/data-and-model-versioning
Initialize DVC project
```
dvc init
```
Create AWS S3 bucket
Set AWS credentials: https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/setup-credentials.html

Add DVC remote

# https://dvc.org/doc/command-reference/remote/add#amazon-s3
dvc remote add -d s3_remote s3://<YOUR_BUCKET_PATH>

Set AWS credentials

# https://dvc.org/doc/command-reference/remote/modify#amazon-s3
dvc remote modify --local s3_remote access_key_id 'my_key_id'
dvc remote modify --local s3_remote secret_access_key 'my_access_key'

Start versioning data/raw and models/ directories
```
dvc add data/raw models/clf-model.joblib
```
Note: You’ll see an error raised by DVC because an artifact (file or directory) can only be versioning either DVC or git, but not both.
data/raw and models/clf-model.joblib are already versioned with git, so we need to stop tracking them with git first.
```
git rm -r --cached 'data/raw'
git commit -m "stop tracking data/raw"
git rm -r --cached 'models/clf-model.joblib'
git commit -m "stop tracking models/clf-model.joblib"
# re-run
dvc add data/raw models/clf-model.joblib
```
Push versioned data to the DVC remote
```
dvc push
```

Review git status and push changes to the git server

git status
git add .
git status
git commit -m 'initiate dvc and start tracking data and models'
git push

Slides 🖼️#

Learning objectives#

Project Introduction#

Problem Description and Dataset#

Use Case#

Setup#

Steps#

Manage project dependencies with pip and venv#

Pin down dependencies in requirements.txt#

Create a better project structure#

Move data versioning from Git to DVC#