LEMA-Llama-2-7b (Proof of Concept)

This model is a demonstration of the LEMA (Layer-wise Efficient Memory Abstraction) framework. It proves that large language models (7B+) can be fine-tuned on consumer-grade hardware with limited VRAM (e.g., 16GB Tesla P100) by virtualizing GPU memory.

Key Achievement: Fine-tuned Llama-2-7B using only 6.36 GB of VRAM (standard LoRA typically requires ~14GB+ for this configuration).

Training code is available over on the github repository: LEMA-llama

Model Details

Base Model: NousResearch/Llama-2-7b-hf
Framework: LEMA v1.0
Fine-Tuning Method: LoRA (Rank 16, Alpha 32)
Memory Strategy: Streaming (Triple-Buffer: Disk -> RAM -> VRAM)
Precision: FP16

Training Configuration

The model was trained to learn a strict custom chat format ([LEMA_REPLY]) to verify that weight updates were successfully applied.

Hardware: NVIDIA Tesla P100 (16GB VRAM)
Batch Size: 8 (Gradient Accumulation: 1)
Sequence Length: 512
Steps: 625 (1 Epoch over 5k examples)
Optimizer: AdamW (lr=1e-4)

Memory Efficiency

Metric	Standard PEFT/LoRA	LEMA (This Run)
Peak VRAM	OOM	6.36 GB
System RAM	OOM	2.40 GB

Note: Standard PEFT typically OOMs at Batch Size 4-8 on 16GB cards with 512 context. LEMA held steady at <7GB.

Training Logs

The training loss converged smoothly, demonstrating stable learning despite the layer-wise streaming architecture.

Step 10/625  | Loss: 2.1732 | VRAM: 6.36GB
Step 100/625 | Loss: 0.0677 | VRAM: 6.36GB
Step 200/625 | Loss: 0.0462 | VRAM: 6.36GB
Step 300/625 | Loss: 0.0407 | VRAM: 6.36GB
Step 400/625 | Loss: 0.0412 | VRAM: 6.36GB
Step 500/625 | Loss: 0.0459 | VRAM: 6.36GB
Step 600/625 | Loss: 0.0406 | VRAM: 6.36GB
Final Step   | Training Complete

Derived Metrics

Total Training Time: 5h 40m
Average Step Time: 32.23s
Peak VRAM: 6.36GB (stable)
Peak RAM: 2.52GB

Full raw logs over Here

Limitations & Known Issues

⚠️ Warning: Experimental Proof-of-Concept

This model was trained for only 1 epoch as a mechanical stress test of the LEMA library. While it successfully learned the new vocabulary and special tags, it has not yet mastered the logical structure or grammar of the custom template.

Token Looping: The model may repeat tags like [LEMA_REPLY] multiple times in a loop.
Hallucinations: It may invent creative definitions for terms it hasn't seen in its original pre-training (e.g., hallucinating an acronym for LEMA).
Overfitting: Due to the small, highly repetitive synthetic dataset and 1-epoch training, the model is likely overfit to the specific examples provided.
Template Grammar: It often skips the Explanation: and Confidence: fields.

To achieve production-grade results and make the model actually usable for general tasks, it is recommended to train for 3-5 epochs with a much larger, more diverse dataset (50k+ examples).

Usage

This model uses a custom prompt format for testing purposes:

<|system|>
You are a precise assistant trained using LEMA.

<|user|>
What is LEMA?

<|assistant|>
[LEMA_REPLY]
Answer: ...
Explanation: ...
Confidence: High
[/LEMA_REPLY]

Loading with Transformers

Since this model has been merged (LoRA adapter integrated into base), you can load it as a standard Llama model:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("Pomilon/LEMA-llama-2-7b")
tokenizer = AutoTokenizer.from_pretrained("Pomilon/LEMA-llama-2-7b")

prompt = "<|system|>\nYou are a precise assistant trained using LEMA.\n\n<|user|>\nWhat is LEMA?\n\n<|assistant|>\n[LEMA_REPLY]\nAnswer:"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

output = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(output[0]))

About LEMA

LEMA is an experimental framework designed to democratize LLM fine-tuning. It treats model weights as a stream of data rather than a static block, allowing models to be processed layer-by-layer. This trades computation time (latency) for massive memory savings.

Check out the GitHub Repository

Downloads last month: 97

Safetensors

Model size

7B params

Tensor type

F16

Model tree for Pomilon/LEMA-llama-2-7b

Base model

NousResearch/Llama-2-7b-hf

Finetuned

(64)

this model

Quantizations

2 models