hks350d
/

git-diff-to-commit-gemma-3-270m

@@ -12,19 +12,65 @@ base_model:
 A small, fast model specialized to turn a git diff into a concise, English commit message. Built on top of `google/gemma-3-270m-it` and fine-tuned with LoRA using MLX on macOS.
 ## What this model expects (most important)
 - Input type: a unified git diff as plain text.
-- Wrap the diff in a Markdown code fence labeled `diff` for best results:
-  ```
-  ```diff
-  <your unified git diff here>
-  ```
-  ```
 - The diff should look like the output of `git diff --no-color` (hunk headers like `@@`, `+`/`-` line prefixes, file headers, etc.).
 - Keep diffs reasonably sized. The training/CLI path truncates diffs to ~3,000 characters and trains/infers with a context window of ~2,048 tokens. Extremely large diffs should be summarized or sampled.
 - Language of response: English only. The system prompt enforces English output.
 ### Chat template (Gemma 3)
 The model was trained and inferred using Gemma’s chat template. Conceptually:
@@ -52,83 +98,185 @@ Training data (chat format) examples were stored like:
 ## Quick usage
-### CLI (included in this repo)
-- From a staged diff in your current repo:
-```bash
-python commit_msg_cli.py run --from-git --staged --adapter \
-  --model google/gemma-3-270m-it \
-  --adapter-path ./adapters
-```
-- From a diff file:
-```bash
-python commit_msg_cli.py run --diff path/to/diff.txt --adapter \
-  --model google/gemma-3-270m-it \
-  --adapter-path ./adapters
-```
-The CLI will wrap your diff with the expected prompt/template and return a single-line message.
-### Programmatic (MLX)
 ```python
-from mlx_lm.utils import load as mlx_load
-from mlx_lm.generate import generate
-from chat_template_utils import get_gemma_tokenizer, format_commit_message_prompt
-from mlx_lm import sample_utils
-model_name = "google/gemma-3-270m-it"
-adapter_path = "./adapters"  # or a specific run dir
-diff_text = """diff --git a/app.py b/app.py
-index e69de29..f4c3b4a 100644
---- a/app.py
-+++ b/app.py
-@@ -0,0 +1,3 @@
-+def add(a, b):
-+    return a + b
-+"""
-# Load with adapter if available
-model, tok = mlx_load(model_name, adapter_path=adapter_path)
-# Use Gemma chat template for the prompt
-tokenizer = get_gemma_tokenizer(model_name)
-prompt = format_commit_message_prompt(diff_text, tokenizer, include_generation_prompt=True)
-sampler = sample_utils.make_sampler(temp=0.7, top_p=0.9, top_k=64)
-out = generate(model, tok, prompt=prompt, max_tokens=100, verbose=False, sampler=sampler)
-print(out)
 ```
 ## Examples
-Input (user message content):
 ```diff
-diff --git a/app.py b/app.py
-index e69de29..f4c3b4a 100644
---- a/app.py
-+++ b/app.py
-@@ -0,0 +1,3 @@
-+def add(a, b):
-+    return a + b
 +
 ```
 Possible outputs:
-- Add simple add() helper
-- Implement add function
-- Introduce add utility for two-number sum
 ## Training summary
 - Base model: `google/gemma-3-270m-it` (Gemma 3, 270M, instruction-tuned).
 - Method: LoRA fine-tuning with MLX (`mlx_lm lora`). Prompt masking was enabled so the model learns from the assistant response.
-- Data: Local JSONL converted to chat format with diffs fenced as ```diff and English, single-line commit messages as targets. In this repo, the dataset used (`data/train_gpt-oss-20b.jsonl`) was parsed and converted to a chat messages format. This particular set is Python-focused.
 - Context/config highlights: max sequence length ~2048 tokens; diffs truncated to ~3,000 characters during preprocessing/inference to be model-friendly.
 ## Evaluation
@@ -160,4 +308,4 @@ The repository’s `format_commit_message_prompt` builds the correct prompt for
 ## License and credits
 - Base model: Google Gemma 3 (`google/gemma-3-270m-it`). Use subject to the Gemma license terms.
-- Fine-tuning code: MLX and utilities in this repository. See repository license for details.

 A small, fast model specialized to turn a git diff into a concise, English commit message. Built on top of `google/gemma-3-270m-it` and fine-tuned with LoRA using MLX on macOS.
+## Requirements
+- macOS with Apple Silicon (for MLX)
+- Python 3.8+
+- Required packages:
+  ```bash
+  pip install mlx-lm transformers
+  ```
 ## What this model expects (most important)
 - Input type: a unified git diff as plain text.
+- Wrap the diff in a Markdown code fence labeled `diff` for best results.
 - The diff should look like the output of `git diff --no-color` (hunk headers like `@@`, `+`/`-` line prefixes, file headers, etc.).
 - Keep diffs reasonably sized. The training/CLI path truncates diffs to ~3,000 characters and trains/infers with a context window of ~2,048 tokens. Extremely large diffs should be summarized or sampled.
 - Language of response: English only. The system prompt enforces English output.
+### Training Data Format
+This model was trained on the `data/train_gpt-oss-20b.jsonl` dataset in this repository. The training data uses Gemma's chat template format with the following exact structure:
+**User prompt format (as seen in training data):**
+```
+Generate a concise and descriptive commit message for this git diff:
+```diff
+diff --git a/src/ossos-pipeline/scripts/update_astrometry.py b/src/ossos-pipeline/scripts/update_astrometry.py
+index <HASH>..<HASH> 100644
+--- a/src/ossos-pipeline/scripts/update_astrometry.py
++++ b/src/ossos-pipeline/scripts/update_astrometry.py
+@@ -159,8 +159,11 @@ def recompute_mag(mpc_in):
+     cutout = image_slice_downloader.download_cutout(reading, needs_apcor=True)
+     cutout.zmag = new_zp
++    if math.fabs(new_zp - old_zp) > 0.3:
++        logging.warning("Large change in zeropoint detected: {}  -> {}".format(old_zp, new_zp))
++
+     try:
+-        (x, y, mag, merr) = cutout.get_observed_magnitude(zmag=old_zp)
++        (x, y, mag, merr) = cutout.get_observed_magnitude(zmag=new_zp)
+         (x, y) = cutout.get_observed_coordinates((x, y))
+     except:
+         logging.warn("Failed to do photometry.")
+```
+```
+**Important:** To get the best results, match this exact format including:
+- The instruction text: "Generate a concise and descriptive commit message for this git diff:"
+- The double newline after the instruction
+- The diff wrapped in triple backticks with `diff` language tag
+- Hash placeholders shown as `<HASH>..<HASH>` in the diff headers
+### Chat template (Gemma 3)
+The model was trained using Gemma's chat template with the system prompt enforcing English-only responses. The conceptual structure is:
+- system: "You are a helpful assistant that generates git commit messages. Always respond in English only. Do not use any other language."
+- user: The exact format shown above
+- assistant: single-line commit message (target)
 ### Chat template (Gemma 3)
 The model was trained and inferred using Gemma’s chat template. Conceptually:
 ## Quick usage
+### Python Script (MLX)
+Here's a complete standalone script to generate commit messages using this model:
 ```python
+#!/usr/bin/env python3
+"""
+Standalone script to generate git commit messages using the fine-tuned Gemma model.
+Requires: mlx-lm, transformers
+Install with: pip install mlx-lm transformers
+"""
+import subprocess
+import sys
+from mlx_lm import load, generate
+from transformers import AutoTokenizer
+def get_staged_diff():
+    """Get the staged git diff from the current repository."""
+    try:
+        result = subprocess.run(
+            ['git', 'diff', '--staged', '--no-color'],
+            capture_output=True, text=True, check=True
+        )
+        return result.stdout.strip()
+    except subprocess.CalledProcessError:
+        print("Error: Could not get git diff. Make sure you're in a git repository with staged changes.")
+        return None
+def format_prompt(diff_text, tokenizer):
+    """Format the diff into the exact training data format."""
+    system_prompt = "You are a helpful assistant that generates git commit messages. Always respond in English only. Do not use any other language."
+    user_message = f"Generate a concise and descriptive commit message for this git diff:\n\n```diff\n{diff_text}\n```"
+    # Format using Gemma chat template
+    messages = [
+        {"role": "system", "content": system_prompt},
+        {"role": "user", "content": user_message}
+    ]
+    prompt = tokenizer.apply_chat_template(
+        messages,
+        tokenize=False,
+        add_generation_prompt=True
+    )
+    return prompt
+def generate_commit_message(diff_text, model_path="your-username/git-diff-to-commit-gemma-3-270m"):
+    """Generate a commit message from a git diff."""
+    # Load model and tokenizer
+    print("Loading model...")
+    model, mlx_tokenizer = load(model_path)
+    hf_tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-270m-it")
+    # Format the prompt
+    prompt = format_prompt(diff_text, hf_tokenizer)
+    # Generate response
+    print("Generating commit message...")
+    response = generate(
+        model,
+        mlx_tokenizer,
+        prompt=prompt,
+        max_tokens=100,
+        temp=0.7,
+        top_p=0.9,
+        verbose=False
+    )
+    # Extract just the generated part (after the prompt)
+    generated_text = response[len(prompt):].strip()
+    # Return the first non-empty line
+    lines = [line.strip() for line in generated_text.split('\n') if line.strip()]
+    return lines[0] if lines else "Unable to generate commit message"
+def main():
+    """Main function - can be used with staged diff or provided diff text."""
+    if len(sys.argv) > 1:
+        # Use provided diff file
+        diff_file = sys.argv[1]
+        try:
+            with open(diff_file, 'r') as f:
+                diff_text = f.read().strip()
+        except FileNotFoundError:
+            print(f"Error: File {diff_file} not found.")
+            return
+    else:
+        # Get staged diff from git
+        diff_text = get_staged_diff()
+        if not diff_text:
+            print("No staged changes found. Stage some changes with 'git add' first.")
+            return
+    if not diff_text:
+        print("No diff content to process.")
+        return
+    # Generate and print commit message
+    commit_message = generate_commit_message(diff_text)
+    print(f"\nSuggested commit message:")
+    print(f"  {commit_message}")
+if __name__ == "__main__":
+    main()
 ```
+### Usage Examples
+1. **Generate from staged git changes:**
+   ```bash
+   python generate_commit.py
+   ```
+2. **Generate from a diff file:**
+   ```bash
+   python generate_commit.py my_changes.diff
+   ```
+3. **Use in your own code:**
+   ```python
+   from generate_commit import generate_commit_message
+   diff = """diff --git a/app.py b/app.py
+   index e69de29..f4c3b4a 100644
+   --- a/app.py
+   +++ b/app.py
+   @@ -0,0 +1,3 @@
+   +def add(a, b):
+   +    return a + b
+   """
+   message = generate_commit_message(diff)
+   print(message)
+   ```
 ## Examples
+Input (user message content as formatted in training data):
+```
+Generate a concise and descriptive commit message for this git diff:
 ```diff
+diff --git a/src/ossos-pipeline/scripts/update_astrometry.py b/src/ossos-pipeline/scripts/update_astrometry.py
+index <HASH>..<HASH> 100644
+--- a/src/ossos-pipeline/scripts/update_astrometry.py
++++ b/src/ossos-pipeline/scripts/update_astrometry.py
+@@ -159,8 +159,11 @@ def recompute_mag(mpc_in):
+     cutout = image_slice_downloader.download_cutout(reading, needs_apcor=True)
+     cutout.zmag = new_zp
++    if math.fabs(new_zp - old_zp) > 0.3:
++        logging.warning("Large change in zeropoint detected: {}  -> {}".format(old_zp, new_zp))
 +
+     try:
+-        (x, y, mag, merr) = cutout.get_observed_magnitude(zmag=old_zp)
++        (x, y, mag, merr) = cutout.get_observed_magnitude(zmag=new_zp)
+         (x, y) = cutout.get_observed_coordinates((x, y))
+     except:
+         logging.warn("Failed to do photometry.")
+```
 ```
 Possible outputs:
+- fix: use new_zp instead of old_zp for magnitude calculation and add zeropoint change warning
+- fix: correct zeropoint usage in photometry and add warning for large zeropoint changes
+- refactor: update magnitude calculation to use new zeropoint and add change detection
 ## Training summary
 - Base model: `google/gemma-3-270m-it` (Gemma 3, 270M, instruction-tuned).
 - Method: LoRA fine-tuning with MLX (`mlx_lm lora`). Prompt masking was enabled so the model learns from the assistant response.
+- **Training data**: `data/train_gpt-oss-20b.jsonl` in this repository - a dataset converted to chat format with diffs fenced as ```diff and English, single-line commit messages as targets. This dataset is Python-focused.
+- Data format: Each training example uses the exact user prompt format shown above in the chat template structure.
 - Context/config highlights: max sequence length ~2048 tokens; diffs truncated to ~3,000 characters during preprocessing/inference to be model-friendly.
+- **Important**: To achieve best results, match the exact input format used in the training data.
 ## Evaluation
 ## License and credits
 - Base model: Google Gemma 3 (`google/gemma-3-270m-it`). Use subject to the Gemma license terms.
+- Fine-tuning code: MLX and utilities in this repository. See repository license for details.