Slides 🖼️
Learning objectives
- Understand the core philosophy behind MLOps ideas
- Apply best practices for establishing ML project structure and dependencies management
- Manage project dependencies with pip and virtualenv
- Version datasets with DVC
Project Introduction
Problem Description and Dataset
This dataset contains 10,000 records, each of which corresponds to a different bank’s user. The target is Exited
, a binary variable that describes whether the user decided to leave the bank. There are row and customer identifiers, four columns describing personal information about the user (surname, location, gender, and age), and some other columns containing information related to the loan (such as credit score, the current balance in the user’s account and whether they are an active member among others).
Dataset source:Â https://www.kaggle.com/datasets/filippoo/deep-learning-az-ann
Use Case
The objective is to train an ML model that returns the probability of a customer churning. This is a binary classification task, therefore F1-score is a good metric to evaluate the performance of this dataset as it weights recall and precision equally, and a good retrieval algorithm will maximize both precision and recall simultaneously.
https://github.com/alex000kim/open-source-mlops-e2e
Setup
- Fork repository
If running locally, proceed to the next section.
If using gitpod.io:
- Log into gitpod.io
- Create a gitpod workspace using the forked repository
Steps
Manage project dependencies with pip
and venv
(The below is not needed when running on gitpod.io because this setup is defined in gitpod.yml)
- Create
dev
branchgit branch dev git checkout dev
- Create a virtual environment
python -m venv .venv
- Install dependencies from
requirements.txt
source .venv/bin/activate pip install -U pip pip install -r requirements.txt
Pin down dependencies in requirements.txt
Two options:
pip freeze > requirements.txt
- Manually pin down versions of each installed library
git add requirements.txt
git commit -m "pin down library versions"
git push --set-upstream origin dev
Create a better project structure
Review: https://drivendata.github.io/cookiecutter-data-science/
Move
Churn_Modelling_*.csv
files todata/raw/
directoryMove
TrainChurnModel.ipynb
tonotebooks/
directory and update it accordinglyRefactor and run
TrainChurnModel.ipynb
- Read data from
data/raw/
- Save
cm.png
toreports/figures
- Save
feat_imp.csv
toreports
(delete the file at the root of the repo) - Save
clf-model.joblib
tomodels
(delete the file at the root of the repo)
Final result should look like this: TrainChurnModel.ipynb
- Read data from
Push changes to git
Further reading:
- https://realpython.com/python-application-layouts/
- https://drivendata.github.io/cookiecutter-data-science/
Move data versioning from Git to DVC
Initialize DVC project
dvc init
Create AWS S3 bucket
Set AWS credentials: https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/setup-credentials.html
Add DVC remote
# https://dvc.org/doc/command-reference/remote/add#amazon-s3 dvc remote add -d s3_remote s3://<YOUR_BUCKET_PATH>
Set AWS credentials
# https://dvc.org/doc/command-reference/remote/modify#amazon-s3 dvc remote modify --local s3_remote access_key_id 'my_key_id' dvc remote modify --local s3_remote secret_access_key 'my_access_key'
Start versioning
data/raw
andmodels/
directoriesdvc add data/raw models/clf-model.joblib
Note: You’ll see an error raised by DVC because an artifact (file or directory) can only be versioning either DVC or git, but not both.
data/raw
andmodels/clf-model.joblib
are already versioned with git, so we need to stop tracking them with git first.git rm -r --cached 'data/raw' git commit -m "stop tracking data/raw" git rm -r --cached 'models/clf-model.joblib' git commit -m "stop tracking models/clf-model.joblib" # re-run dvc add data/raw models/clf-model.joblib
Push versioned data to the DVC remote
dvc push
Review
git status
and push changes to the git servergit status git add . git status git commit -m 'initiate dvc and start tracking data and models' git push