US Tariffs, DeepSeek and OpenAI

This week the new Trump administration announced new tariffs on key trading partners, Canada and Mexico. Both countries vowed to retaliate, and tensions are high: “Canada imposes 25% tariffs in trade war with US” Living in Canada, I am obviusly curious about the impact this will all have both both sides of the border. I am a self-described economics nerd: having never studied economics, I try to read and listen about when I have spare time. One other topic that’s been on my mind recently is the recent release of the DeepSeek R1 model and the the release of the OpenAI O3 model that followed. Both are SOTA (as of Feb 2025) LLM models that were trained to “reason” before providing an answer through a chain of thought Reinforcement Learning: ...

February 2, 2025

Orchestrating LLM Fine-tuning on Kubernetes with SkyPilot and MLflow: A Complete Guide

Training and fine-tuning Large Language Models (LLMs) requires significant computational resources and careful experiment tracking. While many focus on the modeling aspects, efficiently managing compute resources and experiment tracking is equally important for successful ML projects. This guide demonstrates how to leverage SkyPilot and MLflow - two powerful open-source tools - to orchestrate LLM fine-tuning jobs effectively. An open-source stack for LLM fine-tuning Modern LLM fine-tuning workflows involve multiple moving parts: Resource orchestration across different cloud providers Environment setup and dependency management Experiment tracking and monitoring Distributed training coordination System metrics collection Using SkyPilot for resource orchestration and MLflow for experiment tracking provides an easy-to-use and fully open-source stack for managing these complexities. ...

January 11, 2025

Kubernetes Mental Model

I am preparing for my CKAD (Certified Kubernetes Application Developer) exam. Below is the mental model of K8S concepts that helps me understand Kubernetes. Hope it helps you too. The Big Picture: Kubernetes as an Orchestrator What is Kubernetes? Kubernetes is an automation system for deploying and managing containerized applications at scale. Rather than manually handling each container, you define your desired state—like “I want three replicas of my service running.” Kubernetes ensures this state remains true even if servers fail or traffic surges. ...

January 4, 2025

Intro to SLURM for ML Practitioners

SLURM (Simple Linux Utility for Resource Management) is an open-source workload manager designed to schedule and manage jobs on large clusters. In the world of LLMs, SLURM has seen a resurgence in popularity due to the increased demand for training large models and scaling them to multiple nodes. This guide will introduce the fundamental concepts of SLURM, common commands and script structures, and show advanced scenarios like distributed multi-node training. I’ll also share some useful tips and tricks. ...

November 24, 2024

Experiments with OpenAI's Function Calling

Intro This notebook (also on github) demonstrates how to use Function Calling functionality with the OpenAI API. In this demo, we’ll use the Northwind database to convert natural language queries into SQL: "What is the total revenue for each product in the database?" -> -> "SELECT ... FROM ..." -> DataFrame There will be two function calling examples: A simple one-step function call to convert a natural language query into SQL, where we’ll put the database schema into the system prompt and them use function calling to convert a natural language query into SQL. A two-step function call first gets the schema of the database and then converts a natural language query into SQL. At the end, we’ll compare the two approaches and do a quick-and-dirty evaluation of the results using a hand-curated list of questions and their expected SQL queries in eval_questions.csv. ...

May 5, 2024
Fine-Tuning Large Language Models with a Production-Grade Pipeline

Fine-Tuning Large Language Models with a Production-Grade Pipeline

Introduction - Solving cloud resources and reproducibility for LLMs A few of weeks ago, I wrote a post about the challenges of training large ML models, in particular: the need for more computing power and the complexity of managing cloud resources; the difficulty of keeping track of ML experiments and reproducing results. There I proposed a solution to these problems by using SkyPilot and DVC to manage cloud resources and track experiments, respectively. ...

September 8, 2023

Trying to Understand Something Difficult? Minimize the Number of Attempts!

When you’re trying to wrap your head around challenging new ideas, your natural instinct may be to chip away at it in several attempts. Read a bit, take a break, come back later - like attacking a mountain ascent in short sprints. This may feel like you’re making progress, but you’re actually doing yourself a disservice. Every time you re-engage with the difficult material, you’re forcing your mind to re-enter a state of intense flow. This refocusing requires mental energy - energy that depletes your limited cognitive resources. Each time you dip back into flow, you have to: ...

August 16, 2023
ML experiments in the cloud with Skypilot and DVC

ML experiments in the cloud with SkyPilot and DVC

Introduction One of the things that makes machine learning hard is that you have to run a lot of experiments. You have to try different models, different data sets, different hyperparameters, different features. And each experiment can take a long time to run, especially if you’re working on deep learning problems. You can’t just run them on your laptop or desktop. You need more computing power, and you need it fast. ...

August 10, 2023

Why Sales Engineers Exist

Paul Graham once wrote that “a startup is a company designed to grow fast.” For a startup to grow fast, especially a B2B startup selling SaaS products to other companies, it needs sales engineers. Sales engineers are the technical experts who work closely with sales teams. They exist because selling enterprise software requires deep technical knowledge that salespeople typical don’t have. The salesperson establishes rapport with the prospect and understands their business needs. But when it comes time to demo the product and answer nitty gritty technical questions, that’s where the sales engineer comes in. ...

August 2, 2023

Don’t know what to do next? Teach!

We (esp. those in the tech industry) have an instinctual aversion to pedagogy. “Those who can’t do, teach” the old saying goes. But the truth is the opposite. Teaching does not indicate an inability to do something. On the contrary, teaching empowers and enables ability. It is the highest form of understanding. When you teach something, you gain a deeper mastery over the subject matter than you would as a passive student. As Leonardo da Vinci said: ...

July 31, 2023