LEMA-Llama-2-7b (Proof of Concept)
This model is a demonstration of the LEMA (Layer-wise Efficient Memory Abstraction) framework. It proves that large language models (7B+) can be fine-tuned on consumer-grade hardware with limited VRAM (e.g., 16GB Tesla P100) by virtualizing GPU memory.
Key Achievement: Fine-tuned Llama-2-7B using only 6.36 GB of VRAM (standard LoRA typically requires ~14GB+ for this configuration).
Training code is available over on the github repository: LEMA-llama
Model Details
- Base Model:
NousResearch/Llama-2-7b-hf - Framework: LEMA v1.0
- Fine-Tuning Method: LoRA (Rank 16, Alpha 32)
- Memory Strategy: Streaming (Triple-Buffer: Disk -> RAM -> VRAM)
- Precision: FP16
Training Configuration
The model was trained to learn a strict custom chat format ([LEMA_REPLY]) to verify that weight updates were successfully applied.
- Hardware: NVIDIA Tesla P100 (16GB VRAM)
- Batch Size: 8 (Gradient Accumulation: 1)
- Sequence Length: 512
- Steps: 625 (1 Epoch over 5k examples)
- Optimizer: AdamW (lr=1e-4)
Memory Efficiency
| Metric | Standard PEFT/LoRA | LEMA (This Run) |
|---|---|---|
| Peak VRAM | OOM | 6.36 GB |
| System RAM | OOM | 2.40 GB |
Note: Standard PEFT typically OOMs at Batch Size 4-8 on 16GB cards with 512 context. LEMA held steady at <7GB.
Training Logs
The training loss converged smoothly, demonstrating stable learning despite the layer-wise streaming architecture.
Step 10/625 | Loss: 2.1732 | VRAM: 6.36GB
Step 100/625 | Loss: 0.0677 | VRAM: 6.36GB
Step 200/625 | Loss: 0.0462 | VRAM: 6.36GB
Step 300/625 | Loss: 0.0407 | VRAM: 6.36GB
Step 400/625 | Loss: 0.0412 | VRAM: 6.36GB
Step 500/625 | Loss: 0.0459 | VRAM: 6.36GB
Step 600/625 | Loss: 0.0406 | VRAM: 6.36GB
Final Step | Training Complete
Derived Metrics
- Total Training Time: 5h 40m
- Average Step Time: 32.23s
- Peak VRAM: 6.36GB (stable)
- Peak RAM: 2.52GB
Full raw logs over Here
Limitations & Known Issues
⚠️ Warning: Experimental Proof-of-Concept
This model was trained for only 1 epoch as a mechanical stress test of the LEMA library. While it successfully learned the new vocabulary and special tags, it has not yet mastered the logical structure or grammar of the custom template.
- Token Looping: The model may repeat tags like
[LEMA_REPLY]multiple times in a loop. - Hallucinations: It may invent creative definitions for terms it hasn't seen in its original pre-training (e.g., hallucinating an acronym for LEMA).
- Overfitting: Due to the small, highly repetitive synthetic dataset and 1-epoch training, the model is likely overfit to the specific examples provided.
- Template Grammar: It often skips the
Explanation:andConfidence:fields.
To achieve production-grade results and make the model actually usable for general tasks, it is recommended to train for 3-5 epochs with a much larger, more diverse dataset (50k+ examples).
Usage
This model uses a custom prompt format for testing purposes:
<|system|>
You are a precise assistant trained using LEMA.
<|user|>
What is LEMA?
<|assistant|>
[LEMA_REPLY]
Answer: ...
Explanation: ...
Confidence: High
[/LEMA_REPLY]
Loading with Transformers
Since this model has been merged (LoRA adapter integrated into base), you can load it as a standard Llama model:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("Pomilon/LEMA-llama-2-7b")
tokenizer = AutoTokenizer.from_pretrained("Pomilon/LEMA-llama-2-7b")
prompt = "<|system|>\nYou are a precise assistant trained using LEMA.\n\n<|user|>\nWhat is LEMA?\n\n<|assistant|>\n[LEMA_REPLY]\nAnswer:"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(output[0]))
About LEMA
LEMA is an experimental framework designed to democratize LLM fine-tuning. It treats model weights as a stream of data rather than a static block, allowing models to be processed layer-by-layer. This trades computation time (latency) for massive memory savings.
- Downloads last month
- 97