Intro to SLURM for ML Practitioners

SLURM (Simple Linux Utility for Resource Management) is an open-source workload manager designed to schedule and manage jobs on large clusters. In the world of LLMs, SLURM has seen a resurgence in popularity due to the increased demand for training large models and scaling them to multiple nodes. This guide will introduce the fundamental concepts of SLURM, common commands and script structures, and show advanced scenarios like distributed multi-node training. I’ll also share some useful tips and tricks. ...

November 24, 2024 · Alex Kim

Experiments with OpenAI's Function Calling

Intro This notebook (also on github) demonstrates how to use Function Calling functionality with the OpenAI API. In this demo, we’ll use the Northwind database to convert natural language queries into SQL: "What is the total revenue for each product in the database?" -> -> "SELECT ... FROM ..." -> DataFrame There will be two function calling examples: A simple one-step function call to convert a natural language query into SQL, where we’ll put the database schema into the system prompt and them use function calling to convert a natural language query into SQL. A two-step function call first gets the schema of the database and then converts a natural language query into SQL. At the end, we’ll compare the two approaches and do a quick-and-dirty evaluation of the results using a hand-curated list of questions and their expected SQL queries in eval_questions.csv. ...

May 5, 2024 · Alex Kim
Fine-Tuning Large Language Models with a Production-Grade Pipeline

Fine-Tuning Large Language Models with a Production-Grade Pipeline

Introduction - Solving cloud resources and reproducibility for LLMs A few of weeks ago, I wrote a post about the challenges of training large ML models, in particular: the need for more computing power and the complexity of managing cloud resources; the difficulty of keeping track of ML experiments and reproducing results. There I proposed a solution to these problems by using SkyPilot and DVC to manage cloud resources and track experiments, respectively. ...

September 8, 2023 · Alex Kim
ML experiments in the cloud with Skypilot and DVC

ML experiments in the cloud with SkyPilot and DVC

Introduction One of the things that makes machine learning hard is that you have to run a lot of experiments. You have to try different models, different data sets, different hyperparameters, different features. And each experiment can take a long time to run, especially if you’re working on deep learning problems. You can’t just run them on your laptop or desktop. You need more computing power, and you need it fast. ...

August 10, 2023 · Alex Kim