Intro to SLURM for ML Practitioners

SLURM (Simple Linux Utility for Resource Management) is an open-source workload manager designed to schedule and manage jobs on large clusters. In the world of LLMs, SLURM has seen a resurgence in popularity due to the increased demand for training large models and scaling them to multiple nodes. This guide will introduce the fundamental concepts of SLURM, common commands and script structures, and show advanced scenarios like distributed multi-node training. I’ll also share some useful tips and tricks. ...

November 24, 2024 · Alex Kim
Fine-Tuning Large Language Models with a Production-Grade Pipeline

Fine-Tuning Large Language Models with a Production-Grade Pipeline

Introduction - Solving cloud resources and reproducibility for LLMs A few of weeks ago, I wrote a post about the challenges of training large ML models, in particular: the need for more computing power and the complexity of managing cloud resources; the difficulty of keeping track of ML experiments and reproducing results. There I proposed a solution to these problems by using SkyPilot and DVC to manage cloud resources and track experiments, respectively. ...

September 8, 2023 · Alex Kim

Week 1: Kick-starting an ML project

Slides 🖼️ Week 1: ML project lifecycle and MLOps best practices Learning objectives Understand the core philosophy behind MLOps ideas Apply best practices for establishing ML project structure and dependencies management Manage project dependencies with pip and virtualenv Version datasets with DVC Project Introduction Problem Description and Dataset This dataset contains 10,000 records, each of which corresponds to a different bank’s user. The target is Exited, a binary variable that describes whether the user decided to leave the bank. There are row and customer identifiers, four columns describing personal information about the user (surname, location, gender, and age), and some other columns containing information related to the loan (such as credit score, the current balance in the user’s account and whether they are an active member among others). ...

Alex Kim

Week 2: ML Pipelines, Reproducibility Experimentation

Slides 🖼️ Week 2: ML Pipelines, Reproducibility and Experimentation Learning objectives Refactor a Jupyter notebook into a reproducible ML pipeline Version artifacts of an ML pipeline in a remote storage Iterate over a large number of ML experiments in a disciplined way Steps Refactor Jupyter notebook in a DVC pipeline Docs: https://dvc.org/doc/start/data-pipelines Create the following files to read parameter values from a file params.yaml base: project: bank_customer_churn raw_data_dir: data/raw countries: - France - Spain feat_cols: - CreditScore - Age - Tenure - Balance - NumOfProducts - HasCrCard - IsActiveMember - EstimatedSalary targ_col: Exited random_state: 42 data_split: test_size: 0.25 processed_data_dir: data/processed train: model_type: randomforest model_dir: models model_path: models/clf-model.joblib params: n_estimators: 200 max_depth: 20 ...

Alex Kim

Week 3: CI/CD for ML and ML-based Web API

Slides 🖼️ Week 3: CI/CD for ML Learning Objectives Learn the basics of CI/CD Leverage the power of CI/CD tools for ML projects with CML Integrate an ML model into the FastAPI framework Build and test a Docker container running a web API service Deploy the resulting Docker container to cloud Steps Introduction to GitHub Actions and CML Introduction to GitHub Actions Introduction to CML CI/CD: Automatic reporting for model-related changes Add PERSONAL_ACCESS_TOKEN , AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY to GH secrets: https://docs.github.com/en/actions/security-guides/encrypted-secrets For AWS credentials, see https://docs.aws.amazon.com/sdk-for-javascript/v2/developer-guide/getting-your-credentials.html For PERSONAL_ACCESS_TOKEN: Generate a new personal access token under GitHub developer settings in the “Note” field, type PERSONAL_ACCESS_TOKEN select repo scope click “Generate token” and copy it In your GitHub repository and/or organization, navigate to Settings -> Secrets -> New repository/organization secret in the “Name” field, type PERSONAL_ACCESS_TOKEN in the “Value” field, paste the token click Add secret Create.github/workflows/train-model.yaml name: train-model on: push: paths: - "data/**" - "src/**" - "params.yaml" - "dvc.*" jobs: train-model: runs-on: ubuntu-latest environment: cloud permissions: contents: read id-token: write steps: - uses: actions/checkout@v3 with: ref: ${{ github.event.pull_request.head.sha }} - uses: iterative/setup-cml@v1 - uses: actions/setup-python@v2 with: python-version: "3.10" - uses: actions/setup-node@v1 with: node-version: '16' - name: SetupGitUser run: cml ci env: REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }} - name: TrainModel env: REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }} AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }} AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }} run: | pip install -r requirements.txt dvc pull dvc exp run dvc push # Create CML report echo "## Metrics" >> report.md dvc metrics show --md >> report.md echo "## Feature Importances" >> report.md csv2md reports/feat_imp.csv >> report.md echo "## Confusion Matrix" >> report.md echo '![](reports/figures/cm.png)' >> report.md cml comment create report.md Push workflow file with git Modify some model parameters (e.g. max_depth), rerun the pipeline (dvc exp run) and push changes to DVC remote and git Review GitHub Actions runs Web App Development Create web application src/app/main.py import json import sys from pathlib import Path import uvicorn src_path = Path(__file__).parent.parent.resolve() sys.path.append(str(src_path)) from typing import List import pandas as pd from fastapi import Body, FastAPI, Request from fastapi.middleware.cors import CORSMiddleware from joblib import load from pydantic import BaseModel from utils.load_params import load_params app = FastAPI() # https://fastapi.tiangolo.com/tutorial/cors/#use-corsmiddleware app.add_middleware( CORSMiddleware, allow_origins=["*"], allow_credentials=True, allow_methods=["*"], allow_headers=["*"], ) params = load_params(params_path='params.yaml') model_path = params.train.model_path feat_cols = params.base.feat_cols model = load(filename=model_path) class Customer(BaseModel): CreditScore: int Age: int Tenure: int Balance: float NumOfProducts: int HasCrCard: int IsActiveMember: int EstimatedSalary: float class Request(BaseModel): data: List[Customer] @app.post("/predict") async def predict(info: Request = Body(..., example={ "data": [ { "CreditScore": 619, "Age": 42, "Tenure": 2, "Balance": 0, "NumOfProducts": 1, "HasCrCard": 1, "IsActiveMember": 1, "EstimatedSalary": 101348.88 }, { "CreditScore": 699, "Age": 39, "Tenure": 21, "Balance": 0, "NumOfProducts": 2, "HasCrCard": 0, "IsActiveMember": 0, "EstimatedSalary": 93826.63 } ] })): json_list = json.loads(info.json()) data = json_list['data'] input_data = pd.DataFrame(data) probs = model.predict_proba(input_data)[:,0] probs = probs.tolist() return probs if __name__ == "__main__": uvicorn.run(app, host="0.0.0.0", port=8000) Test API ...

Alex Kim

Week 4: Monitoring for ML Projects

Slides 🖼️ Week 4: Data Drift Monitoring for ML Projects Learning Objectives Distinguish between application monitoring and ML monitoring Use Alibi Detect framework to detect data drift Steps Introduction to Data Drift Monitoring What’s data drift and why do we need to monitor for it? Intro to Alibi Detect Add Churn_Modelling_Germany.csv to data/more_data/ Churn_Modelling_Germany.csv Add /more_data entry to data/.gitignore Create and explore notebooks/DriftDetection.ipynb DriftDetection.ipynb Incorporate drift detection into the DVC pipeline Create src/stages/drift_detector.py import sys from pathlib import Path src_path = Path(__file__).parent.parent.resolve() sys.path.append(str(src_path)) import argparse import pandas as pd from alibi_detect.cd import TabularDrift from alibi_detect.saving import save_detector from joblib import load from utils.load_params import load_params def train_drift_detector(params): processed_data_dir = Path(params.data_split.processed_data_dir) model_dir = Path(params.train.model_dir) model_path = Path(params.train.model_path) model = load(model_path) X_test = pd.read_pickle(processed_data_dir/'X_test.pkl') X_train = pd.read_pickle(processed_data_dir/'X_train.pkl') X = pd.concat([X_test, X_train]) feat_names = X.columns.tolist() preprocessor = model[:-1] categories_per_feature = {i:None for i,k in enumerate(feat_names) if k.startswith('cat__')} cd = TabularDrift(X, p_val=.05, preprocess_fn=preprocessor.transform, categories_per_feature=categories_per_feature) detector_path = model_dir/'drift_detector' save_detector(cd, detector_path) if __name__ == '__main__': args_parser = argparse.ArgumentParser() args_parser.add_argument('--config', dest='config', required=True) args = args_parser.parse_args() params = load_params(params_path=args.config) train_drift_detector(params) Add train_drift_detector stage to dvc.yaml ...

Alex Kim