| # Project Structure | |
| This document provides a detailed overview of the CodeLLaMA-Linux-BugFix project structure, explaining the purpose and organization of each component. | |
| ## π Root Directory | |
| ``` | |
| CodeLLaMA-Linux-BugFix/ | |
| βββ dataset_builder/ # Dataset creation and processing | |
| βββ dataset/ # Generated datasets and data files | |
| βββ train/ # Model training scripts and outputs | |
| βββ evaluate/ # Model evaluation and testing | |
| βββ requirements.txt # Python dependencies | |
| βββ README.md # Project documentation | |
| βββ PROJECT_STRUCTURE.md # This file | |
| ``` | |
| ## π§ Dataset Builder (`dataset_builder/`) | |
| The dataset builder extracts bug-fix data from the Linux kernel Git repository and converts it into training-ready format. | |
| ### Files: | |
| - **`extract_linux_bugfixes.py`** - Main dataset extraction script | |
| - Uses PyDriller to analyze Linux kernel Git history | |
| - Filters commits using bug-fix keywords | |
| - Extracts code context around bug locations | |
| - Generates structured dataset entries | |
| - **`extract_linux_bugfixes_parallel.py`** - Parallelized version of dataset builder | |
| - Multi-process implementation for faster processing | |
| - Configurable worker count (default: 16 workers) | |
| - Test mode with limited commit processing | |
| - **`format_for_training.py`** - Format conversion script | |
| - Converts structured data to prompt-completion pairs | |
| - Formats input for supervised fine-tuning | |
| - Creates training-ready JSONL format | |
| ### Key Features: | |
| - **Commit Filtering**: Identifies bug-fix commits using 17 keywords | |
| - **Code Context**: Extracts 10 lines before/after bug location | |
| - **File Filtering**: Focuses on C and header files (`.c`, `.h`) | |
| - **Diff Extraction**: Captures Git diff patches for fixes | |
| ## π Dataset (`dataset/`) | |
| Contains the generated datasets used for training and evaluation. | |
| ### Files: | |
| - **`training_data_100k.jsonl`** - Main training dataset | |
| - 100,000 bug-fix samples | |
| - Structured format with input/output pairs | |
| - Stored using Git LFS for large file handling | |
| - **`training_data_prompt_completion.jsonl`** - Converted training format | |
| - Prompt-completion pairs for supervised learning | |
| - Optimized for transformer model training | |
| - Stored using Git LFS | |
| ### Data Format: | |
| ```json | |
| { | |
| "input": { | |
| "original code": "C code snippet with bug", | |
| "instruction": "Bug fix instruction from commit message" | |
| }, | |
| "output": { | |
| "diff codes": "Git diff showing the fix" | |
| } | |
| } | |
| ``` | |
| ## π Training (`train/`) | |
| Contains all training-related scripts, configurations, and model outputs. | |
| ### Files: | |
| - **`train_codellama_qlora_linux_bugfix.py`** - Main training script | |
| - QLoRA fine-tuning implementation | |
| - Optimized for H200 GPU with bfloat16 | |
| - Includes Weights & Biases integration | |
| - Comprehensive training configuration | |
| - **`train_codellama_qlora_simple.py`** - Alternative training script | |
| - Simplified QLoRA implementation | |
| - Basic training setup without advanced features | |
| - Good for testing and development | |
| - **`download_codellama_model.py`** - Model download utility | |
| - Downloads base CodeLLaMA-7B-Instruct model | |
| - Ensures model availability before training | |
| ### Output Directory (`train/output/`): | |
| - **`qlora-codellama-bugfix/`** - Main model output | |
| - **`adapter_model.safetensors`** - LoRA adapter weights | |
| - **`adapter_config.json`** - LoRA configuration | |
| - **`tokenizer.json`** - Tokenizer files | |
| - **`chat_template.jinja`** - Conversation template | |
| - **`checkpoint-500/`** - Training checkpoint at step 500 | |
| - **`checkpoint-1000/`** - Training checkpoint at step 1000 | |
| - **`README.md`** - Model card and documentation | |
| ### Training Configuration: | |
| - **Base Model**: `codellama/CodeLLaMA-7b-Instruct-hf` | |
| - **Method**: QLoRA with 4-bit quantization | |
| - **LoRA Config**: r=64, alpha=16, dropout=0.1 | |
| - **Training**: 3 epochs, batch size 64, learning rate 2e-4 | |
| - **Hardware**: Optimized for H200 GPU | |
| ## π Evaluation (`evaluate/`) | |
| Contains evaluation scripts and results for assessing model performance. | |
| ### Files: | |
| - **`evaluate_linux_bugfix_model.py`** - Main evaluation script | |
| - Loads fine-tuned model for inference | |
| - Generates predictions on test data | |
| - Computes BLEU and ROUGE metrics | |
| - Saves results in multiple formats | |
| - **`test_samples.jsonl`** - Evaluation dataset | |
| - Test samples for model evaluation | |
| - Stored using Git LFS | |
| ### Output Directory (`evaluate/output/`): | |
| - **`eval_results.json`** - Detailed evaluation results | |
| - Complete predictions and references | |
| - Stored using Git LFS | |
| - **`eval_results.csv`** - Tabular evaluation results | |
| - CSV format for easy analysis | |
| - Stored using Git LFS | |
| ### Evaluation Metrics: | |
| - **BLEU Score**: Measures translation quality | |
| - **ROUGE Score**: Evaluates text generation accuracy | |
| - **Human Evaluation**: Qualitative assessment | |
| ## π§ Dependencies (`requirements.txt`) | |
| Comprehensive list of Python packages required for the project: | |
| ### Core ML Libraries: | |
| - `transformers==4.53.1` - Hugging Face transformers | |
| - `torch==2.7.1+cu128` - PyTorch with CUDA support | |
| - `peft==0.16.0` - Parameter-efficient fine-tuning | |
| - `accelerate==1.8.1` - Distributed training | |
| - `bitsandbytes==0.46.1` - Quantization support | |
| ### Data Processing: | |
| - `datasets==3.6.0` - Dataset handling | |
| - `pandas==2.3.1` - Data manipulation | |
| - `numpy==2.3.1` - Numerical computing | |
| ### Git Analysis: | |
| - `pydriller` - Git repository mining | |
| - `gitpython` - Git operations | |
| ### Utilities: | |
| - `tqdm==4.67.1` - Progress bars | |
| - `wandb` - Experiment tracking | |
| - `evaluate==0.4.4` - Evaluation metrics | |
| ## π Workflow | |
| ### 1. Dataset Creation | |
| ```bash | |
| cd dataset_builder | |
| python extract_linux_bugfixes.py # Extract bug-fix data | |
| python format_for_training.py # Convert format | |
| ``` | |
| ### 2. Model Training | |
| ```bash | |
| cd train | |
| python train_codellama_qlora_linux_bugfix.py # Train with QLoRA | |
| ``` | |
| ### 3. Model Evaluation | |
| ```bash | |
| cd evaluate | |
| python evaluate_linux_bugfix_model.py # Evaluate performance | |
| ``` | |
| ## π― Key Design Principles | |
| ### Modularity | |
| - Each component has a specific responsibility | |
| - Clear separation between data, training, and evaluation | |
| - Easy to modify or extend individual components | |
| ### Efficiency | |
| - QLoRA for memory-efficient training | |
| - Parallel processing for dataset creation | |
| - Optimized for modern GPU hardware | |
| ### Reproducibility | |
| - Version-controlled dependencies | |
| - Structured data formats | |
| - Comprehensive logging and evaluation | |
| ### Scalability | |
| - Configurable parameters for different hardware | |
| - Support for distributed training | |
| - Efficient data handling with Git LFS | |
| ## π File Naming Conventions | |
| - **Scripts**: Descriptive names with clear purpose | |
| - **Datasets**: Include size/version information | |
| - **Models**: Include architecture and method | |
| - **Results**: Include timestamp or version | |
| - **Configs**: Use `.json` or `.yaml` format | |
| ## π Documentation | |
| - **README.md**: Project overview and quick start | |
| - **PROJECT_STRUCTURE.md**: This detailed structure guide | |
| - **Model README**: Generated model cards in output directories | |
| - **Code Comments**: Inline documentation in all scripts | |
| This structure ensures the project is organized, maintainable, and easy to understand for both users and contributors. |