Project Structure
This document provides a detailed overview of the CodeLLaMA-Linux-BugFix project structure, explaining the purpose and organization of each component.
π Root Directory
CodeLLaMA-Linux-BugFix/
βββ dataset_builder/ # Dataset creation and processing
βββ dataset/ # Generated datasets and data files
βββ train/ # Model training scripts and outputs
βββ evaluate/ # Model evaluation and testing
βββ requirements.txt # Python dependencies
βββ README.md # Project documentation
βββ PROJECT_STRUCTURE.md # This file
π§ Dataset Builder (dataset_builder/)
The dataset builder extracts bug-fix data from the Linux kernel Git repository and converts it into training-ready format.
Files:
extract_linux_bugfixes.py- Main dataset extraction script- Uses PyDriller to analyze Linux kernel Git history
- Filters commits using bug-fix keywords
- Extracts code context around bug locations
- Generates structured dataset entries
extract_linux_bugfixes_parallel.py- Parallelized version of dataset builder- Multi-process implementation for faster processing
- Configurable worker count (default: 16 workers)
- Test mode with limited commit processing
format_for_training.py- Format conversion script- Converts structured data to prompt-completion pairs
- Formats input for supervised fine-tuning
- Creates training-ready JSONL format
Key Features:
- Commit Filtering: Identifies bug-fix commits using 17 keywords
- Code Context: Extracts 10 lines before/after bug location
- File Filtering: Focuses on C and header files (
.c,.h) - Diff Extraction: Captures Git diff patches for fixes
π Dataset (dataset/)
Contains the generated datasets used for training and evaluation.
Files:
training_data_100k.jsonl- Main training dataset- 100,000 bug-fix samples
- Structured format with input/output pairs
- Stored using Git LFS for large file handling
training_data_prompt_completion.jsonl- Converted training format- Prompt-completion pairs for supervised learning
- Optimized for transformer model training
- Stored using Git LFS
Data Format:
{
"input": {
"original code": "C code snippet with bug",
"instruction": "Bug fix instruction from commit message"
},
"output": {
"diff codes": "Git diff showing the fix"
}
}
π Training (train/)
Contains all training-related scripts, configurations, and model outputs.
Files:
train_codellama_qlora_linux_bugfix.py- Main training script- QLoRA fine-tuning implementation
- Optimized for H200 GPU with bfloat16
- Includes Weights & Biases integration
- Comprehensive training configuration
train_codellama_qlora_simple.py- Alternative training script- Simplified QLoRA implementation
- Basic training setup without advanced features
- Good for testing and development
download_codellama_model.py- Model download utility- Downloads base CodeLLaMA-7B-Instruct model
- Ensures model availability before training
Output Directory (train/output/):
qlora-codellama-bugfix/- Main model outputadapter_model.safetensors- LoRA adapter weightsadapter_config.json- LoRA configurationtokenizer.json- Tokenizer fileschat_template.jinja- Conversation templatecheckpoint-500/- Training checkpoint at step 500checkpoint-1000/- Training checkpoint at step 1000README.md- Model card and documentation
Training Configuration:
- Base Model:
codellama/CodeLLaMA-7b-Instruct-hf - Method: QLoRA with 4-bit quantization
- LoRA Config: r=64, alpha=16, dropout=0.1
- Training: 3 epochs, batch size 64, learning rate 2e-4
- Hardware: Optimized for H200 GPU
π Evaluation (evaluate/)
Contains evaluation scripts and results for assessing model performance.
Files:
evaluate_linux_bugfix_model.py- Main evaluation script- Loads fine-tuned model for inference
- Generates predictions on test data
- Computes BLEU and ROUGE metrics
- Saves results in multiple formats
test_samples.jsonl- Evaluation dataset- Test samples for model evaluation
- Stored using Git LFS
Output Directory (evaluate/output/):
eval_results.json- Detailed evaluation results- Complete predictions and references
- Stored using Git LFS
eval_results.csv- Tabular evaluation results- CSV format for easy analysis
- Stored using Git LFS
Evaluation Metrics:
- BLEU Score: Measures translation quality
- ROUGE Score: Evaluates text generation accuracy
- Human Evaluation: Qualitative assessment
π§ Dependencies (requirements.txt)
Comprehensive list of Python packages required for the project:
Core ML Libraries:
transformers==4.53.1- Hugging Face transformerstorch==2.7.1+cu128- PyTorch with CUDA supportpeft==0.16.0- Parameter-efficient fine-tuningaccelerate==1.8.1- Distributed trainingbitsandbytes==0.46.1- Quantization support
Data Processing:
datasets==3.6.0- Dataset handlingpandas==2.3.1- Data manipulationnumpy==2.3.1- Numerical computing
Git Analysis:
pydriller- Git repository mininggitpython- Git operations
Utilities:
tqdm==4.67.1- Progress barswandb- Experiment trackingevaluate==0.4.4- Evaluation metrics
π Workflow
1. Dataset Creation
cd dataset_builder
python extract_linux_bugfixes.py # Extract bug-fix data
python format_for_training.py # Convert format
2. Model Training
cd train
python train_codellama_qlora_linux_bugfix.py # Train with QLoRA
3. Model Evaluation
cd evaluate
python evaluate_linux_bugfix_model.py # Evaluate performance
π― Key Design Principles
Modularity
- Each component has a specific responsibility
- Clear separation between data, training, and evaluation
- Easy to modify or extend individual components
Efficiency
- QLoRA for memory-efficient training
- Parallel processing for dataset creation
- Optimized for modern GPU hardware
Reproducibility
- Version-controlled dependencies
- Structured data formats
- Comprehensive logging and evaluation
Scalability
- Configurable parameters for different hardware
- Support for distributed training
- Efficient data handling with Git LFS
π File Naming Conventions
- Scripts: Descriptive names with clear purpose
- Datasets: Include size/version information
- Models: Include architecture and method
- Results: Include timestamp or version
- Configs: Use
.jsonor.yamlformat
π Documentation
- README.md: Project overview and quick start
- PROJECT_STRUCTURE.md: This detailed structure guide
- Model README: Generated model cards in output directories
- Code Comments: Inline documentation in all scripts
This structure ensures the project is organized, maintainable, and easy to understand for both users and contributors.