See axolotl config

axolotl version: 0.13.0.dev0

base_model: meta-llama/Llama-3.2-3B
tokenizer_type: AutoTokenizer
trust_remote_code: true
strict: false

is_llama_derived_model: true

chat_template: chatml

plugins:
  - axolotl.integrations.liger.LigerPlugin

special_tokens:
  pad_token: "<|eot_id|>"

datasets:
  - path: nvidia/Llama-Nemotron-Post-Training-Dataset
    name: SFT           
    split: chat         
    type: chat_template
    field_messages: input            
    message_property_mappings:
      role: role
      content: content
    field_output: output

train_on_inputs: false

sequence_len: 8192
eval_sequence_len: 8192
pad_to_sequence_len: true
sample_packing: true
sample_packing_group_size: 100000
sample_packing_bin_size: 200
group_by_length: true

flash_attn: true

micro_batch_size: 1               
gradient_accumulation_steps: 8    
num_epochs: 3

learning_rate: 2.0e-5
optimizer: adamw_torch_fused
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1.0e-8

lr_scheduler: cosine
warmup_steps: 100
weight_decay: 0.0   

bf16: true          
tf32: true
gradient_checkpointing: true
activation_offloading: false

val_set_size: 0.01          
eval_strategy: steps
eval_steps: 100

save_strategy: steps
save_steps: 100
save_total_limit: 3
save_only_model: false
save_safetensors: true
load_best_model_at_end: true
metric_for_best_model: eval_loss
greater_is_better: false

logging_steps: 10

output_dir: ./outputs/Llama-3.2-3B-base-nemotron-3epochs/
seed: 42

use_wandb: true
wandb_project: "llama31_base_nemotron"
wandb_name: "llama31-8b-base-nemotron"

outputs/Llama-3.2-3B-base-nemotron-3epochs/

This model is a fine-tuned version of meta-llama/Llama-3.2-3B on the nvidia/Llama-Nemotron-Post-Training-Dataset dataset. It achieves the following results on the evaluation set:

Loss: 1.1378
Memory/max Active (gib): 30.79
Memory/max Allocated (gib): 30.79
Memory/device Reserved (gib): 45.32

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 2e-05
train_batch_size: 1
eval_batch_size: 1
seed: 42
gradient_accumulation_steps: 8
total_train_batch_size: 8
optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine
lr_scheduler_warmup_steps: 100
training_steps: 855

Training results

Training Loss	Epoch	Step	Validation Loss	Active (gib)	Allocated (gib)	Reserved (gib)
No log	0	0	3.5478	17.34	17.34	17.62
1.615	0.3498	100	1.5971	30.79	30.79	44.95
1.3711	0.6996	200	1.3929	30.79	30.79	45.32
1.1403	1.0490	300	1.3040	30.79	30.79	45.32
1.077	1.3988	400	1.2131	30.79	30.79	45.32
1.0224	1.7486	500	1.1687	30.79	30.79	45.32
0.9557	2.0979	600	1.1472	30.79	30.79	45.32
0.9446	2.4477	700	1.1403	30.79	30.79	45.32
0.9357	2.7976	800	1.1378	30.79	30.79	45.32

Framework versions

Transformers 4.57.1
Pytorch 2.9.0+cu130
Datasets 4.3.0
Tokenizers 0.22.1

Downloads last month: 8

Safetensors

Model size

3B params

Tensor type

F32

BF16

Model tree for cemig-temp/llama-3.2-3b-base-nemotron-3epochs

Base model

meta-llama/Llama-3.2-3B

Finetuned

(367)

this model

cemig-temp
/

llama-3.2-3b-base-nemotron-3epochs

outputs/Llama-3.2-3B-base-nemotron-3epochs/

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for cemig-temp/llama-3.2-3b-base-nemotron-3epochs

Dataset used to train cemig-temp/llama-3.2-3b-base-nemotron-3epochs

Evaluation results