Orchestrating LLM Fine-tuning on Kubernetes with SkyPilot and MLflow: A Complete Guide

Training and fine-tuning Large Language Models (LLMs) requires significant computational resources and careful experiment tracking. While many focus on the modeling aspects, efficiently managing compute resources and experiment tracking is equally important for successful ML projects. This guide demonstrates how to leverage SkyPilot and MLflow - two powerful open-source tools - to orchestrate LLM fine-tuning jobs effectively. An open-source stack for LLM fine-tuning Modern LLM fine-tuning workflows involve multiple moving parts: Resource orchestration across different cloud providers Environment setup and dependency management Experiment tracking and monitoring Distributed training coordination System metrics collection Using SkyPilot for resource orchestration and MLflow for experiment tracking provides an easy-to-use and fully open-source stack for managing these complexities. ...

January 11, 2025 · Alex Kim

Kubernetes Mental Model

I am preparing for my CKAD (Certified Kubernetes Application Developer) exam. Below is the mental model of K8S concepts that helps me understand Kubernetes. Hope it helps you too. The Big Picture: Kubernetes as an Orchestrator What is Kubernetes? Kubernetes is an automation system for deploying and managing containerized applications at scale. Rather than manually handling each container, you define your desired state—like “I want three replicas of my service running.” Kubernetes ensures this state remains true even if servers fail or traffic surges. ...

January 4, 2025 · Alex Kim

Intro to SLURM for ML Practitioners

SLURM (Simple Linux Utility for Resource Management) is an open-source workload manager designed to schedule and manage jobs on large clusters. In the world of LLMs, SLURM has seen a resurgence in popularity due to the increased demand for training large models and scaling them to multiple nodes. This guide will introduce the fundamental concepts of SLURM, common commands and script structures, and show advanced scenarios like distributed multi-node training. I’ll also share some useful tips and tricks. ...

November 24, 2024 · Alex Kim

Experiments with OpenAI's Function Calling

Intro This notebook (also on github) demonstrates how to use Function Calling functionality with the OpenAI API. In this demo, we’ll use the Northwind database to convert natural language queries into SQL: "What is the total revenue for each product in the database?" -> -> "SELECT ... FROM ..." -> DataFrame There will be two function calling examples: A simple one-step function call to convert a natural language query into SQL, where we’ll put the database schema into the system prompt and them use function calling to convert a natural language query into SQL. A two-step function call first gets the schema of the database and then converts a natural language query into SQL. At the end, we’ll compare the two approaches and do a quick-and-dirty evaluation of the results using a hand-curated list of questions and their expected SQL queries in eval_questions.csv. ...

May 5, 2024 · Alex Kim
Fine-Tuning Large Language Models with a Production-Grade Pipeline

Fine-Tuning Large Language Models with a Production-Grade Pipeline

Introduction - Solving cloud resources and reproducibility for LLMs A few of weeks ago, I wrote a post about the challenges of training large ML models, in particular: the need for more computing power and the complexity of managing cloud resources; the difficulty of keeping track of ML experiments and reproducing results. There I proposed a solution to these problems by using SkyPilot and DVC to manage cloud resources and track experiments, respectively. ...

September 8, 2023 · Alex Kim
ML experiments in the cloud with Skypilot and DVC

ML experiments in the cloud with SkyPilot and DVC

Introduction One of the things that makes machine learning hard is that you have to run a lot of experiments. You have to try different models, different data sets, different hyperparameters, different features. And each experiment can take a long time to run, especially if you’re working on deep learning problems. You can’t just run them on your laptop or desktop. You need more computing power, and you need it fast. ...

August 10, 2023 · Alex Kim