Orchestrating LLM Fine-tuning on Kubernetes with SkyPilot and MLflow: A Complete Guide

Training and fine-tuning Large Language Models (LLMs) requires significant computational resources and careful experiment tracking. While many focus on the modeling aspects, efficiently managing compute resources and experiment tracking is equally important for successful ML projects. This guide demonstrates how to leverage SkyPilot and MLflow - two powerful open-source tools - to orchestrate LLM fine-tuning jobs effectively.

An open-source stack for LLM fine-tuning

Modern LLM fine-tuning workflows involve multiple moving parts:

Resource orchestration across different cloud providers
Environment setup and dependency management
Experiment tracking and monitoring
Distributed training coordination
System metrics collection

Using SkyPilot for resource orchestration and MLflow for experiment tracking provides an easy-to-use and fully open-source stack for managing these complexities.

We’ll use the LLama-3-1-8B fine-tuning example from Philipp Schmid’s “How to fine-tune open LLMs in 2025” blogpost to demonstrate these tools in action.

Setting Up the Stack

SkyPilot Configuration

💡 Below I’ll be using a kubernetes cluster, however SkyPilot supports virtualy all cloud providers you can think of.

First, install SkyPilot with Kubernetes support using pip:

$ pip install "skypilot[kubernetes]"

Configure your Kubernetes cluster access by ensuring your kubeconfig is properly set up. Then, verify the installation:

$ sky check kubernetes
Checking credentials to enable clouds for SkyPilot.
  Kubernetes: enabled                              

To enable a cloud, follow the hints above and rerun: sky check 
If any problems remain, refer to detailed docs at: https://docs.skypilot.co/en/latest/getting-started/installation.html

🎉 Enabled clouds 🎉
  ✔ Kubernetes

SkyPilot uses a YAML configuration to define jobs. Below the SkyPilot task definition sky.yaml is all that’s needed to kick of our training job on our infra.

# To launch the cluster:
# sky launch -c dev sky.yaml --env-file .env
# To rerun training (i.e. only the "run" section):
# sky exec dev sky.yaml --env-file .env 
resources:
  cloud: kubernetes # or aws, gcp, azure, and many others
  accelerators: H100:8

workdir: . # syncs current directory to ~/sky_workdir/ on the cluster.

envs:
  CONFIG_FILE: recipes/llama-3-1-8b-qlora.yaml

# setup step is executed once upon cluster provisioning with `sky launch`
setup: |
  sudo apt install nvtop -y
  pip install -U -r requirements.txt
  FLASH_ATTENTION_SKIP_CUDA_BUILD=TRUE pip install flash-attn --no-build-isolation
  python generate_train_dataset.py  

# run step is executed for both `sky exec` and `sky launch` commands
run: |
  accelerate launch \
  --num_processes 8 \
  train.py --config $CONFIG_FILE

You can find the rest of the training code and configs in this repository. The details of what’s inside of train.py are beyond the scope of this post. This repository also contains sky_multi_node.yaml, a multi-node version of the sky.yaml file.

MLflow Configuration

In this example project, MLflow configuration is managed through environment variables. Create a .env file:

# .env
MLFLOW_TRACKING_URI=https://your-mlflow-server
MLFLOW_TRACKING_SERVER_CERT_PATH=/path/to/cert.pem
MLFLOW_ENABLE_SYSTEM_METRICS_LOGGING=true
MLFLOW_EXPERIMENT_NAME=LLM_Fine_Tuning
MLFLOW_TRACKING_USERNAME=your-username
MLFLOW_TRACKING_PASSWORD=your-password
HF_TOKEN=your-huggingface-token
# TEST_MODE=true  # Uncomment for development

The MLFLOW_ENABLE_SYSTEM_METRICS_LOGGING=true setting enables collection of system metrics (GPU utilization, memory usage, etc.) but requires additional dependencies:

# requirements.txt
psutil==6.1.1
pynvml==12.0.0

For this tutorial, I used a managed MLflow service from Nebius AI. Setting up the MLflow server instance was straightforward, and they provided all the necessary configuration values to connect with the server.

Nebius AI MLflow

Kicking Off Training

Once you have your configuration files ready, launching training with SkyPilot is straightforward:

Initial Launch: To provision the cluster and start training for the first time, use:
```
sky launch -c dev sky.yaml --env-file .env
```
Subsequent Runs: For additional training runs on the same cluster, use:
```
sky exec dev sky.yaml --env-file .env
```

While the training is running SkyPilot jobs will be steaming logs into the console:

SkyPilot job logs

You can monitor the cluster status using:

sky status

To stop the cluster when training is complete:

sky down dev

For debugging purposes, you can connect to the running cluster via SSH:

ssh dev

MLflow Integration with Distributed Training

In distributed training environments, MLflow logging must be carefully managed to prevent logging conflicts between processes. Multiple processes attempting to log metrics simultaneously can lead to race conditions, duplicate entries, or corrupted logs. Additionally, system metrics need to be properly attributed to individual nodes to maintain accurate monitoring data. Here’s the key integration code:

def train_function(model_args, script_args, training_args):
    # Initialize MLflow callback
    mlflow_callback = MLflowCallback()

    trainer = SFTTrainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        tokenizer=tokenizer,
        peft_config=peft_config,
        callbacks=[mlflow_callback],
    )

    # Only initialize MLflow on the main process
    run_id = None
    if trainer.accelerator.is_main_process:
        mlflow_callback.setup(training_args, trainer, model)
        # Set node ID for system metrics
        node_id = trainer.accelerator.process_index
        mlflow_callback._ml_flow.system_metrics.set_system_metrics_node_id(node_id)
        # Get run ID for post-training logging
        run_id = mlflow_callback._ml_flow.active_run().info.run_id
        logger.info(f'Run ID: {run_id}')

    # Training loop
    train_result = trainer.train()

    # Post-training metrics logging only on main process
    if trainer.accelerator.is_main_process:
        if run_id is not None:
            metrics = train_result.metrics
            train_samples = len(train_dataset)
            with mlflow.start_run(run_id=run_id):
                mlflow.log_param('train_samples', train_samples)
                for key, value in metrics.items():
                    mlflow.log_metric(key=key, value=value)

Key Considerations for Distributed Training

Process Management: Only the main process should initialize MLflow runs and log metrics to avoid conflicts.
Run ID Tracking: The MLflow run ID is stored becausetrainer.train()automatically ends its MLflow run when complete. Without capturing the ID beforehand, we wouldn’t be able to log additional metrics after training finishes.
System Metrics: Each node in distributed training needs a unique identifier for system metrics collection

Monitoring Training Progress

MLflow provides a web UI for monitoring experiments. Key metrics tracked include:

Training Metrics:
- Loss
- Learning rate
- Batch size
- Training speed (samples/second)
System Metrics:
- GPU utilization
- GPU memory usage
- CPU utilization
- System memory usage

MLflow Experiment overview

MLflow Run overview

Model Training metrics

System Metrics

Here’s how to query metrics programmatically assuming you’ve set all required MLflow environment variables:

# query_mlflow.py
import mlflow

def get_training_metrics(run_id):
    client = mlflow.MlflowClient()
    run = client.get_run(run_id)
    metrics = run.data.metrics
    params = run.data.params
    return metrics, params

if __name__ == "__main__":
    run_id = "<your run_id"
    metrics, params = get_training_metrics(run_id)
    print(f"Metrics: {metrics}")
    print(f"Params: {params}")

A Few Best Practices and Tips

Use a .env file for local development
If using HF Accelerate, always check trainer.accelerator.is_main_process before MLflow operations
Monitor system metrics to optimize resource usage
When using cloud providers, use managed jobs that can automatically recover from any underlying spot preemptions or hardware failures
Take advantage of sky queue for scheduling multiple training runs with different hyperparameters
Utilize sky logs to access historical job outputs and debugging information

Conclusion

The combination of SkyPilot and MLflow creates a powerful, open-source stack for orchestrating LLM fine-tuning jobs. Key benefits include:

Flexible resource management across cloud providers
Comprehensive experiment tracking
Detailed system metrics monitoring
Support for distributed training
Integration with popular ML frameworks

This setup scales seamlessly from single-GPU experiments to large distributed training jobs and can be extended to handle complex workflows.

An open-source stack for LLM fine-tuning#

Setting Up the Stack#

SkyPilot Configuration#

MLflow Configuration#

Kicking Off Training#

MLflow Integration with Distributed Training#

Key Considerations for Distributed Training#

Monitoring Training Progress#

A Few Best Practices and Tips#

Conclusion#

Additional Resources#