# Trackio Integration for TRL Training **Trackio** is an experiment tracking library that provides real-time metrics visualization for remote training on Hugging Face Jobs infrastructure. ⚠️ **IMPORTANT**: For Jobs training (remote cloud GPUs): - Training happens on ephemeral cloud runners (not your local machine) - Trackio syncs metrics to a Hugging Face Space for real-time monitoring - Without a Space, metrics are lost when the job completes - The Space dashboard persists your training metrics permanently ## Setting Up Trackio for Jobs **Step 1: Add trackio dependency** ```python # /// script # dependencies = [ # "trl>=0.12.0", # "trackio", # Required! # ] # /// ``` **Step 2: Create a Trackio Space (one-time setup)** **Option A: Let Trackio auto-create (Recommended)** Pass a `space_id` to `trackio.init()` and Trackio will automatically create the Space if it doesn't exist. **Option B: Create manually** - Create Space via Hub UI at https://huggingface.co/new-space - Select Gradio SDK - OR use command: `huggingface-cli repo create my-trackio-dashboard --type space --space_sdk gradio` **Step 3: Initialize Trackio with space_id** ```python import trackio trackio.init( project="my-training", space_id="username/trackio", # CRITICAL for Jobs! Replace 'username' with your HF username config={ "model": "Qwen/Qwen2.5-0.5B", "dataset": "trl-lib/Capybara", "learning_rate": 2e-5, } ) ``` **Step 4: Configure TRL to use Trackio** ```python SFTConfig( report_to="trackio", # ... other config ) ``` **Step 5: Finish tracking** ```python trainer.train() trackio.finish() # Ensures final metrics are synced ``` ## What Trackio Tracks Trackio automatically logs: - ✅ Training loss - ✅ Learning rate - ✅ GPU utilization - ✅ Memory usage - ✅ Training throughput - ✅ Custom metrics ## How It Works with Jobs 1. **Training runs** → Metrics logged to local SQLite DB 2. **Every 5 minutes** → Trackio syncs DB to HF Dataset (Parquet) 3. **Space dashboard** → Reads from Dataset, displays metrics in real-time 4. **Job completes** → Final sync ensures all metrics persisted ## Default Configuration Pattern **Use sensible defaults for trackio configuration unless user requests otherwise.** ### Recommended Defaults ```python import trackio trackio.init( project="qwen-capybara-sft", name="baseline-run", # Descriptive name user will recognize space_id="username/trackio", # Default space: {username}/trackio config={ # Keep config minimal - hyperparameters and model/dataset info only "model": "Qwen/Qwen2.5-0.5B", "dataset": "trl-lib/Capybara", "learning_rate": 2e-5, "num_epochs": 3, } ) ``` **Key principles:** - **Space ID**: Use `{username}/trackio` with "trackio" as default space name - **Run naming**: Unless otherwise specified, name the run in a way the user will recognize - **Config**: Keep minimal - don't automatically capture job metadata unless requested - **Grouping**: Optional - only use if user requests organizing related experiments ## Grouping Runs (Optional) The `group` parameter helps organize related runs together in the dashboard sidebar. This is useful when user is running multiple experiments with different configurations but wants to compare them together: ```python # Example: Group runs by experiment type trackio.init(project="my-project", run_name="baseline-run-1", group="baseline") trackio.init(project="my-project", run_name="augmented-run-1", group="augmented") trackio.init(project="my-project", run_name="tuned-run-1", group="tuned") ``` Runs with the same group name can be grouped together in the sidebar, making it easier to compare related experiments. You can group by any configuration parameter: ```python # Hyperparameter sweep - group by learning rate trackio.init(project="hyperparam-sweep", run_name="lr-0.001-run", group="lr_0.001") trackio.init(project="hyperparam-sweep", run_name="lr-0.01-run", group="lr_0.01") ``` ## Environment Variables for Jobs You can configure trackio using environment variables instead of passing parameters to `trackio.init()`. This is useful for managing configuration across multiple jobs. **`HF_TOKEN`** Required for creating Spaces and writing to datasets (passed via `secrets`): ```python hf_jobs("uv", { "script": "...", "secrets": { "HF_TOKEN": "$HF_TOKEN" # Enables Space creation and Hub push } }) ``` ### Example with Environment Variables ```python hf_jobs("uv", { "script": """ # Training script - trackio config from environment import trackio from datetime import datetime # Auto-generate run name timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M") run_name = f"sft_qwen25_{timestamp}" # Project and space_id can come from environment variables trackio.init(run_name=run_name, group="SFT") # ... training code ... trackio.finish() """, "flavor": "a10g-large", "timeout": "2h", "secrets": {"HF_TOKEN": "$HF_TOKEN"} }) ``` **When to use environment variables:** - Managing multiple jobs with same configuration - Keeping training scripts portable across projects - Separating configuration from code **When to use direct parameters:** - Single job with specific configuration - When clarity in code is preferred - When each job has different project/space ## Viewing the Dashboard After starting training: 1. Navigate to the Space: `https://huggingface.co/spaces/username/trackio` 2. The Gradio dashboard shows all tracked experiments 3. Filter by project, compare runs, view charts with smoothing ## Recommendation - **Trackio**: Best for real-time monitoring during long training runs - **Weights & Biases**: Best for team collaboration, requires account