Upload folder using huggingface_hub

Browse files

Files changed (5) hide show

README.md +109 -199
config.json +8 -8
model.safetensors +2 -2
modeling_dhara.py +335 -136
tokenizer.json +2 -16

README.md CHANGED Viewed

@@ -2,253 +2,163 @@
 license: apache-2.0
 language:
 - en
 tags:
-- text-generation
 - diffusion
-- language-model
-- causal-lm
 datasets:
-- HuggingFaceFW/fineweb-edu
-- allenai/dolma
-- mlfoundations/dclm-baseline-1.0
-model-index:
-- name: dhara-70m
-  results:
-  - task:
-      type: text-generation
-    dataset:
-      name: HellaSwag
-      type: hellaswag
-    metrics:
-    - name: Accuracy
-      type: accuracy
-      value: 25.58
-  - task:
-      type: text-generation
-    dataset:
-      name: PIQA
-      type: piqa
-    metrics:
-    - name: Accuracy
-      type: accuracy
-      value: 51.58
-  - task:
-      type: text-generation
-    dataset:
-      name: WinoGrande
-      type: winogrande
-    metrics:
-    - name: Accuracy
-      type: accuracy
-      value: 49.64
-  - task:
-      type: text-generation
-    dataset:
-      name: ARC-Challenge
-      type: arc_challenge
-    metrics:
-    - name: Accuracy
-      type: accuracy
-      value: 24.83
-  - task:
-      type: text-generation
-    dataset:
-      name: MMLU
-      type: mmlu
-    metrics:
-    - name: Accuracy
-      type: accuracy
-      value: 23.85
-  - task:
-      type: text-generation
-    dataset:
-      name: TruthfulQA
-      type: truthfulqa_mc2
-    metrics:
-    - name: Accuracy
-      type: accuracy
-      value: 47.50
 ---
-# Dhara-70M
-A 70M parameter diffusion language model optimized for high-throughput text generation with superior factuality.
-## Table of Contents
-- [Model Description](#model-description)
-- [Training Data](#training-data)
-- [Training Details](#training-details)
-- [Benchmark Results](#benchmark-results)
-- [Usage](#usage)
-- [Key Insights](#key-insights)
-- [Limitations](#limitations)
-- [Citation](#citation)
 ## Model Description
-Dhara-70M is a novel diffusion language model that achieves:
-- **3.8x higher throughput** than autoregressive models
-- **Best-in-class factuality** on TruthfulQA (47.50%)
-- **10x training efficiency** via WSD (Warmup-Stable-Decay) conversion
-### Architecture
-| Specification | Value |
-|--------------|-------|
-| **Parameters** | 71.34M |
-| **Layers** | 32 |
-| **Hidden Size** | 384 |
-| **FF Dimension** | 1024 |
-| **Attention Heads** | 8 |
-| **KV Heads** | 4 (GQA) |
-| **Context Length** | 2048 tokens |
-| **Position Encoding** | RoPE |
-| **Normalization** | RMSNorm |
-| **Special Layers** | Canon (depthwise causal convolutions) |
-| **Generation Type** | Diffusion (parallel token generation) |
-## Training Data
-Dhara was trained in two stages:
-**Stage 1: AR Pretraining (1B tokens)**
-- 40% FinePDFs (400M tokens)
-- 30% DCLM Baseline (300M tokens)
-- 30% FineWeb-Edu (300M tokens)
-**Stage 2: WSD Conversion (100M tokens)**
-- Progressive block size warmup (1→4→32→64→1024)
-- MDLM diffusion objective
-## Training Details
 | Parameter | Value |
 |-----------|-------|
-| **AR Training Tokens** | 1 billion |
-| **WSD Conversion Tokens** | 100 million |
-| **Batch Size** | 128 effective (8 × 16 gradient accumulation) |
-| **Learning Rate** | 5e-4 (AR) / 5e-5 (WSD) |
-| **Optimizer** | AdamW |
-| **Schedule** | Cosine decay with 2% warmup |
-| **Precision** | BF16 |
-| **Hardware** | Single NVIDIA A40 GPU |
-| **Total Training Time** | ~20 hours |
-## Benchmark Results
-| Benchmark | Dhara-70M | GPT-2-70M | vs GPT-2 |
-|-----------|-----------|-----------|----------|
-| HellaSwag (0-shot) | 25.58% | 26.46% | -0.88% |
-| PIQA (0-shot) | 51.58% | 58.05% | -6.47% |
-| WinoGrande (0-shot) | 49.64% | 52.64% | -3.00% |
-| ARC-Challenge (0-shot) | **24.83%** | 22.27% | **+2.56%** |
-| MMLU (5-shot) | 23.85% | 25.77% | -1.92% |
-| TruthfulQA (0-shot) | **47.50%** | 45.83% | **+1.67%** |
-| GSM8K (5-shot) | 0.00% | 1.21% | -1.21% |
-| **Average** | **31.85%** | **33.18%** | -1.33% |
-### Inference Performance
-| Metric | Dhara-70M | GPT-2-70M | Advantage |
-|--------|-----------|-----------|-----------|
-| Time to First Token | 35.5 ms | ~25 ms | 1.4x slower |
-| Throughput | 183.5 tok/s | ~48 tok/s | **3.8x faster** |
-| Peak Memory | 0.24 GB | 0.15 GB | 1.6x higher |
 ## Usage
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
 # Load model and tokenizer
 tokenizer = AutoTokenizer.from_pretrained("codelion/dhara-70m")
-model = AutoModelForCausalLM.from_pretrained("codelion/dhara-70m", trust_remote_code=True)
-# Generate text using diffusion sampling
-inputs = tokenizer("The future of AI is", return_tensors="pt")
 outputs = model.generate(
-    **inputs,
-    max_new_tokens=40,          # Generate 40 new tokens
-    num_diffusion_steps=10,      # Diffusion denoising steps (higher = better quality)
-    do_sample=True,
     temperature=0.8,
-    top_p=0.9
 )
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```
-### Batch Generation (High Throughput)
-```python
-# For batch generation, use larger batch sizes
-prompts = [
-    "The future of AI is",
-    "In recent years, machine learning has",
-    "The most important discovery in physics was",
-    "Climate change affects our planet by"
-]
-inputs = tokenizer(prompts, return_tensors="pt", padding=True)
-outputs = model.generate(
-    **inputs,
-    max_length=100,
-    do_sample=True,
-    temperature=0.7,
-    num_diffusion_steps=10  # Fewer steps = faster generation
-)
-for i, output in enumerate(outputs):
-    print(f"Prompt {i+1}: {tokenizer.decode(output, skip_special_tokens=True)}")
-```
-## Key Insights
-1. **Throughput vs Accuracy Trade-off**: Dhara trades 1.33% average accuracy for 3.8x higher throughput, making it ideal for batch processing tasks.
-2. **Superior Factuality**: Dhara excels on TruthfulQA (+1.67% vs GPT-2), suggesting diffusion models may reduce hallucinations through bidirectional context.
-3. **Reasoning Advantage**: ARC-Challenge +2.56% indicates strong performance on reasoning tasks.
-4. **WSD Efficiency**: Converting an AR model to diffusion via WSD uses 10x fewer tokens than training from scratch with equivalent quality.
-5. **Canon Layers Help**: The depthwise causal convolutions (Canon layers) improve factuality and reasoning with only 0.13% parameter overhead.
-## When to Use Dhara
-**Choose Dhara when:**
-- Batch generation throughput matters
-- Factual accuracy is critical
-- You have an existing AR checkpoint to convert
-**Choose AR models when:**
-- Interactive latency is critical
-- Sequential reasoning is important (math, coding)
-- Memory is constrained
 ## Limitations
-- Lower performance on sequential reasoning tasks (GSM8K: 0.00%)
-- Higher memory usage due to bidirectional attention
-- Slightly higher time-to-first-token latency
-- Best suited for batch rather than interactive use cases
 ## Citation
 ```bibtex
-@article{sharma2025optimal,
-  title={The Optimal Architecture for Small Language Models},
-  author={Sharma, Asankhaya},
-  year={2025},
-  url={https://huggingface.co/blog/codelion/optimal-model-architecture}
 }
 ```
-## Related Work
-- [The Optimal Architecture for Small Language Models](https://huggingface.co/blog/codelion/optimal-model-architecture) - Blog post describing this work
-- [The 1 Billion Token Challenge: Optimal Dataset Mixing](https://huggingface.co/blog/codelion/optimal-dataset-mixing) - Our previous work on optimal pretraining data
-- [GPT-2-70M](https://huggingface.co/codelion/gpt-2-70m) - Our previous model from optimal pretraining experiments
-## Contact
-For questions or feedback, please open a discussion on the [Hugging Face discussions page](https://huggingface.co/codelion/dhara-70m/discussions).

 license: apache-2.0
 language:
 - en
+library_name: transformers
 tags:
 - diffusion
+- masked-language-model
+- text-generation
+- pytorch
+- transformers
+pipeline_tag: text-generation
 datasets:
+- codelion/pre-training-dataset-samples
 ---
+# Dhara-70M: Diffusion Language Model
+Dhara is a 70M parameter diffusion language model that combines masked diffusion training with Canon layers for improved local context understanding.
 ## Model Description
+Dhara was created by converting a pre-trained autoregressive (AR) LLM to a diffusion model using **Warmup-Stable-Decay (WSD)** training. This approach preserves the language understanding capabilities of the original AR model while enabling bidirectional attention and parallel token generation.
+### Key Features
+- **Bidirectional Attention**: Unlike causal LLMs, Dhara uses full bidirectional attention during generation
+- **Canon Layers**: Incorporates causal depthwise convolutions at positions A (before attention) and C (before MLP) for local context mixing
+- **WSD Conversion**: Trained with 100M tokens to convert AR checkpoint to diffusion while preserving language capabilities
+- **Custom Generate Method**: Includes a specialized `generate()` method for text generation
+### Architecture
 | Parameter | Value |
 |-----------|-------|
+| Parameters | 70M |
+| Hidden Size | 384 |
+| Layers | 32 |
+| Attention Heads | 8 |
+| KV Heads | 4 (GQA) |
+| Intermediate Size | 1024 |
+| Vocabulary | 50,304 |
+| Context Length | 1024 |
+| Canon Kernel | 4 |
+| Canon Positions | A, C |
+## Evaluation Results
+| Benchmark | Score |
+|-----------|-------|
+| HellaSwag | 29.42 |
+| ARC-Easy | 43.35 |
+| ARC-Challenge | 24.15 |
+| PIQA | 61.48 |
+| Winogrande | 50.75 |
+| OpenBookQA | 19.75 |
+| **Average** | **38.15** |
+*Self-reported evaluation results across 6 standard benchmarks.*
 ## Usage
+### Installation
+```bash
+pip install transformers torch
+```
+### Quick Start
 ```python
+import torch
 from transformers import AutoTokenizer, AutoModelForCausalLM
 # Load model and tokenizer
+model = AutoModelForCausalLM.from_pretrained(
+    "codelion/dhara-70m",
+    trust_remote_code=True,
+    torch_dtype=torch.bfloat16
+)
 tokenizer = AutoTokenizer.from_pretrained("codelion/dhara-70m")
+# Move to GPU if available
+device = "cuda" if torch.cuda.is_available() else "cpu"
+model = model.to(device)
+# Generate text
+prompt = "The future of artificial intelligence"
+inputs = tokenizer(prompt, return_tensors="pt").to(device)
 outputs = model.generate(
+    inputs.input_ids,
+    max_new_tokens=50,
     temperature=0.8,
+    top_p=0.9,
+    top_k=50,
+    repetition_penalty=1.2,
+    do_sample=True
 )
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```
+### Generation Parameters
+| Parameter | Default | Description |
+|-----------|---------|-------------|
+| `max_new_tokens` | 50 | Number of tokens to generate |
+| `temperature` | 1.0 | Sampling temperature (higher = more random) |
+| `top_p` | 0.9 | Nucleus sampling threshold |
+| `top_k` | 50 | Top-k sampling threshold |
+| `repetition_penalty` | 1.2 | Penalty for token repetition |
+| `do_sample` | True | Whether to sample or use greedy decoding |
+| `num_diffusion_steps` | 10 | Diffusion refinement steps (for future use) |
+## Training Details
+### Dataset
+Dhara was trained on a curated 1B token sample from the [Pre-training Dataset Samples](https://huggingface.co/collections/codelion/pre-training-dataset-samples) collection.
+### WSD (Warmup-Stable-Decay) Conversion
+Dhara was converted from an autoregressive checkpoint using the WSD training schedule:
+- **Base Model**: LLaMA-style AR model with Canon layers
+- **Total Training Tokens**: 1B tokens (AR) + 100M tokens (diffusion conversion)
+- **WSD Warmup Phase**: 20M tokens
+- **WSD Stable Phase**: 80M tokens
+- **Training Objective**: Masked Diffusion Modeling (MDM)
+### Canon Layers
+Canon layers are causal depthwise convolutions that provide local context mixing with O(n) complexity. Based on "Physics of Language Models: Part 4.1" by Zeyuan Allen-Zhu:
+- **Position A**: Applied after input LayerNorm, before attention
+- **Position C**: Applied after post-attention LayerNorm, before MLP
+- **Kernel Size**: 4 tokens
+- **Residual Connection**: Enabled
+- **Activation**: None (as recommended for transformers)
 ## Limitations
+- This is a research model and may generate inaccurate or inappropriate content
+- Performance may vary on tasks requiring long-range dependencies
+- The model was trained on a limited dataset and may have knowledge gaps
 ## Citation
+If you use this model, please cite:
 ```bibtex
+@misc{dhara2024,
+  title={Dhara: Diffusion Language Model with Canon Layers},
+  author={CodeLion},
+  year={2024},
+  publisher={HuggingFace},
+  url={https://huggingface.co/codelion/dhara-70m}
 }
 ```
+## License
+Apache 2.0

config.json CHANGED Viewed

@@ -2,12 +2,12 @@
   "architectures": [
     "DharaForMaskedDiffusion"
   ],
   "auto_map": {
     "AutoConfig": "modeling_dhara.DharaConfig",
     "AutoModel": "modeling_dhara.DharaForMaskedDiffusion",
     "AutoModelForCausalLM": "modeling_dhara.DharaForMaskedDiffusion"
   },
-  "attention_dropout": 0.0,
   "bos_token_id": 1,
   "canon_activation": false,
   "canon_bias": false,
@@ -15,26 +15,26 @@
   "canon_residual": true,
   "canon_set": "AC",
   "eos_token_id": 2,
-  "head_dim": 64,
   "hidden_act": "silu",
   "hidden_size": 384,
   "initializer_range": 0.02,
   "intermediate_size": 1024,
   "mask_epsilon": 0.001,
   "mask_token_id": 50256,
-  "max_position_embeddings": 2048,
   "model_type": "dhara",
-  "num_attention_heads": 6,
   "num_diffusion_steps": 1000,
   "num_hidden_layers": 32,
-  "num_key_value_heads": 6,
   "pad_token_id": 0,
-  "rms_norm_eps": 1e-05,
   "rope_theta": 10000.0,
-  "torch_dtype": "float32",
   "transformers_version": "4.55.2",
   "use_cache": false,
   "use_flash_attention": false,
   "use_xformers": false,
-  "vocab_size": 50257
 }

   "architectures": [
     "DharaForMaskedDiffusion"
   ],
+  "attention_dropout": 0.0,
   "auto_map": {
     "AutoConfig": "modeling_dhara.DharaConfig",
     "AutoModel": "modeling_dhara.DharaForMaskedDiffusion",
     "AutoModelForCausalLM": "modeling_dhara.DharaForMaskedDiffusion"
   },
   "bos_token_id": 1,
   "canon_activation": false,
   "canon_bias": false,
   "canon_residual": true,
   "canon_set": "AC",
   "eos_token_id": 2,
+  "head_dim": 48,
   "hidden_act": "silu",
   "hidden_size": 384,
   "initializer_range": 0.02,
   "intermediate_size": 1024,
   "mask_epsilon": 0.001,
   "mask_token_id": 50256,
+  "max_position_embeddings": 1024,
   "model_type": "dhara",
+  "num_attention_heads": 8,
   "num_diffusion_steps": 1000,
   "num_hidden_layers": 32,
+  "num_key_value_heads": 4,
   "pad_token_id": 0,
+  "rms_norm_eps": 1e-06,
   "rope_theta": 10000.0,
+  "torch_dtype": "bfloat16",
   "transformers_version": "4.55.2",
   "use_cache": false,
   "use_flash_attention": false,
   "use_xformers": false,
+  "vocab_size": 50304
 }

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:138820db3e8e59ed037924f14f9739ca9667e406465fc236fa9765691386f5fc
-size 304219496

 version https://git-lfs.github.com/spec/v1
+oid sha256:e67749b3a03df19a3a36cbf3994997df649b5c2044cbf581a22e5eb8f473975f
+size 142728304

modeling_dhara.py CHANGED Viewed

@@ -1,24 +1,29 @@
 #!/usr/bin/env python3
 """
-Dhara: Diffusion Language Model
-A diffusion-based language model that combines:
-1. Masked diffusion training (MDM) with bidirectional attention
-2. Canon layers for local context mixing via causal depthwise convolutions
-3. High-throughput parallel token generation
-Usage:
-    from transformers import AutoModel, AutoTokenizer
-    model = AutoModel.from_pretrained("codelion/dhara-70m", trust_remote_code=True)
-    tokenizer = AutoTokenizer.from_pretrained("codelion/dhara-70m")
 """
 import math
-from typing import Optional, Tuple, Union
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
 from transformers import PreTrainedModel
 from transformers.generation import GenerationMixin
@@ -36,12 +41,18 @@ try:
 except ImportError:
     FLASH_ATTN_AVAILABLE = False
 class DharaConfig(PretrainedConfig):
     """
     Configuration for Dhara model.
-    Dhara is a diffusion language model with Canon layers for local context mixing.
     """
     model_type = "dhara"
@@ -49,33 +60,33 @@ class DharaConfig(PretrainedConfig):
     def __init__(
         self,
         # Core architecture
-        vocab_size: int = 50257,
         hidden_size: int = 384,
         num_hidden_layers: int = 32,
-        num_attention_heads: int = 6,
-        num_key_value_heads: int = 6,
         intermediate_size: int = 1024,
         head_dim: int = None,
         max_position_embeddings: int = 2048,
         # Model specifics
         hidden_act: str = "silu",
-        rms_norm_eps: float = 1e-5,
         rope_theta: float = 10000.0,
         initializer_range: float = 0.02,
         tie_word_embeddings: bool = True,
         attention_dropout: float = 0.0,
         # Canon layer parameters
-        canon_set: str = "AC",
-        canon_kernel: int = 4,
-        canon_residual: bool = True,
-        canon_activation: bool = False,
         canon_bias: bool = False,
         # Diffusion specific
-        mask_token_id: int = 50256,
-        mask_epsilon: float = 0.001,
         num_diffusion_steps: int = 1000,
         # Special tokens
@@ -85,7 +96,7 @@ class DharaConfig(PretrainedConfig):
         # Performance flags
         use_cache: bool = False,
-        use_flash_attention: bool = False,
         use_xformers: bool = False,
         **kwargs
@@ -98,6 +109,7 @@ class DharaConfig(PretrainedConfig):
             **kwargs
         )
         self.vocab_size = vocab_size
         self.hidden_size = hidden_size
         self.num_hidden_layers = num_hidden_layers
@@ -107,29 +119,44 @@ class DharaConfig(PretrainedConfig):
         self.head_dim = head_dim or (hidden_size // num_attention_heads)
         self.max_position_embeddings = max_position_embeddings
         self.hidden_act = hidden_act
         self.rms_norm_eps = rms_norm_eps
         self.rope_theta = rope_theta
         self.initializer_range = initializer_range
         self.attention_dropout = attention_dropout
         self.canon_set = canon_set
         self.canon_kernel = canon_kernel
         self.canon_residual = canon_residual
         self.canon_activation = canon_activation
         self.canon_bias = canon_bias
-        self.mask_token_id = mask_token_id
         self.mask_epsilon = mask_epsilon
         self.num_diffusion_steps = num_diffusion_steps
         self.use_cache = use_cache
         self.use_flash_attention = use_flash_attention
         self.use_xformers = use_xformers
 class CanonLayer(nn.Module):
-    """Causal 1D depthwise convolution for local context mixing."""
     def __init__(
         self,
@@ -145,29 +172,49 @@ class CanonLayer(nn.Module):
         self.use_residual = use_residual
         self.use_activation = use_activation
         self.conv = nn.Conv1d(
             in_channels=hidden_size,
             out_channels=hidden_size,
             kernel_size=kernel_size,
-            padding=kernel_size - 1,
-            groups=hidden_size,
             bias=use_bias,
         )
         nn.init.normal_(self.conv.weight, mean=0.0, std=0.02)
         if use_bias:
             nn.init.zeros_(self.conv.bias)
     def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
         batch_size, seq_len, hidden_size = hidden_states.shape
         x = hidden_states.transpose(1, 2)
         out = self.conv(x)
         out = out[:, :, :seq_len]
         if self.use_activation:
             out = F.silu(out)
         out = out.transpose(1, 2)
         if self.use_residual:
             out = hidden_states + out
         return out
@@ -206,6 +253,7 @@ class RotaryEmbedding(nn.Module):
     def _set_cos_sin_cache(self, seq_len, device, dtype):
         self.max_seq_len_cached = seq_len
         t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
         freqs = torch.outer(t, self.inv_freq)
         emb = torch.cat((freqs, freqs), dim=-1)
         self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
@@ -214,6 +262,7 @@ class RotaryEmbedding(nn.Module):
     def forward(self, x, seq_len=None):
         if seq_len > self.max_seq_len_cached:
             self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype)
         return (
             self.cos_cached[:seq_len].to(dtype=x.dtype),
             self.sin_cached[:seq_len].to(dtype=x.dtype),
@@ -221,14 +270,17 @@ class RotaryEmbedding(nn.Module):
 def rotate_half(x):
     x1 = x[..., : x.shape[-1] // 2]
     x2 = x[..., x.shape[-1] // 2 :]
     return torch.cat((-x2, x1), dim=-1)
 def apply_rotary_pos_emb(q, k, cos, sin, position_ids, unsqueeze_dim=1):
     cos = cos[position_ids].unsqueeze(unsqueeze_dim)
     sin = sin[position_ids].unsqueeze(unsqueeze_dim)
     cos = cos.to(q.dtype)
     sin = sin.to(q.dtype)
     q_embed = (q * cos) + (rotate_half(q) * sin)
@@ -241,12 +293,14 @@ class DharaMLP(nn.Module):
     def __init__(self, config):
         super().__init__()
         self.hidden_size = config.hidden_size
         self.intermediate_size = config.intermediate_size
         self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
         self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
         self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
         self.act_fn = nn.SiLU()
     def forward(self, x):
@@ -254,6 +308,7 @@ class DharaMLP(nn.Module):
 def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
     batch, num_key_value_heads, slen, head_dim = hidden_states.shape
     if n_rep == 1:
         return hidden_states
@@ -262,7 +317,7 @@ def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
 class DharaAttention(nn.Module):
-    """Multi-Head Bidirectional Attention with GQA support"""
     def __init__(self, config: DharaConfig, layer_idx: Optional[int] = None):
         super().__init__()
@@ -277,7 +332,13 @@ class DharaAttention(nn.Module):
         self.num_key_value_groups = self.num_heads // self.num_key_value_heads
         self.max_position_embeddings = config.max_position_embeddings
         self.rope_theta = config.rope_theta
-        self.is_causal = False  # Bidirectional for diffusion
         self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False)
         self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=False)
@@ -311,6 +372,12 @@ class DharaAttention(nn.Module):
         kv_seq_len = key_states.shape[-2]
         if past_key_value is not None:
             kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
         cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
@@ -323,6 +390,7 @@ class DharaAttention(nn.Module):
         key_states = repeat_kv(key_states, self.num_key_value_groups)
         value_states = repeat_kv(value_states, self.num_key_value_groups)
         if FLASH_ATTN_AVAILABLE and self.config.use_flash_attention:
             query_states = query_states.transpose(1, 2).contiguous()
             key_states = key_states.transpose(1, 2).contiguous()
@@ -334,17 +402,26 @@ class DharaAttention(nn.Module):
                 value_states = value_states.to(torch.bfloat16)
             attn_output = flash_attn_func(
-                query_states, key_states, value_states,
-                dropout_p=0.0, causal=False,
             )
             attn_output = attn_output.view(bsz, q_len, self.hidden_size)
         else:
             attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
             if attention_mask is not None:
                 attn_weights = attn_weights + attention_mask
             attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
             attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training)
             attn_output = torch.matmul(attn_weights, value_states)
             attn_output = attn_output.transpose(1, 2).contiguous()
             attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
@@ -357,15 +434,23 @@ class DharaAttention(nn.Module):
 class DharaDecoderLayer(nn.Module):
-    """Dhara decoder layer with Canon layers"""
     def __init__(self, config: DharaConfig, layer_idx: int):
         super().__init__()
         self.hidden_size = config.hidden_size
         self.config = config
         self.input_layernorm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
         self.canon_a = None
         if "A" in config.canon_set:
             self.canon_a = CanonLayer(
@@ -376,9 +461,13 @@ class DharaDecoderLayer(nn.Module):
                 use_bias=config.canon_bias,
             )
         self.self_attn = DharaAttention(config=config, layer_idx=layer_idx)
         self.post_attention_layernorm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
         self.canon_c = None
         if "C" in config.canon_set:
             self.canon_c = CanonLayer(
@@ -389,6 +478,7 @@ class DharaDecoderLayer(nn.Module):
                 use_bias=config.canon_bias,
             )
         self.mlp = DharaMLP(config)
     def forward(
@@ -401,11 +491,15 @@ class DharaDecoderLayer(nn.Module):
         use_cache: Optional[bool] = False,
     ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
         residual = hidden_states
         hidden_states = self.input_layernorm(hidden_states)
         if self.canon_a is not None:
             hidden_states = self.canon_a(hidden_states)
         hidden_states, self_attn_weights, present_key_value = self.self_attn(
             hidden_states=hidden_states,
             attention_mask=attention_mask,
@@ -416,9 +510,11 @@ class DharaDecoderLayer(nn.Module):
         )
         hidden_states = residual + hidden_states
         residual = hidden_states
         hidden_states = self.post_attention_layernorm(hidden_states)
         if self.canon_c is not None:
             hidden_states = self.canon_c(hidden_states)
@@ -426,8 +522,10 @@ class DharaDecoderLayer(nn.Module):
         hidden_states = residual + hidden_states
         outputs = (hidden_states,)
         if output_attentions:
             outputs += (self_attn_weights,)
         if use_cache:
             outputs += (present_key_value,)
@@ -456,7 +554,9 @@ class DharaPreTrainedModel(PreTrainedModel):
 class DharaModel(DharaPreTrainedModel):
-    """Dhara base model with bidirectional attention and Canon layers."""
     def __init__(self, config: DharaConfig):
         super().__init__(config)
@@ -467,6 +567,7 @@ class DharaModel(DharaPreTrainedModel):
         self.layers = nn.ModuleList(
             [DharaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
         )
         self.norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
         self.gradient_checkpointing = False
@@ -495,12 +596,14 @@ class DharaModel(DharaPreTrainedModel):
         return_dict: Optional[bool] = None,
     ) -> Union[Tuple, BaseModelOutputWithPast]:
         output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
-        output_hidden_states = output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
         use_cache = use_cache if use_cache is not None else self.config.use_cache
         return_dict = return_dict if return_dict is not None else self.config.use_return_dict
         if input_ids is not None and inputs_embeds is not None:
-            raise ValueError("You cannot specify both input_ids and inputs_embeds")
         elif input_ids is not None:
             batch_size, seq_length = input_ids.shape[:2]
         elif inputs_embeds is not None:
@@ -508,8 +611,12 @@ class DharaModel(DharaPreTrainedModel):
         else:
             raise ValueError("You have to specify either input_ids or inputs_embeds")
-        if self.gradient_checkpointing and self.training and use_cache:
-            use_cache = False
         past_key_values_length = 0
         if use_cache:
@@ -531,8 +638,10 @@ class DharaModel(DharaPreTrainedModel):
         if self._use_flash_attention_2:
             attention_mask = attention_mask if (attention_mask is not None and 0 in attention_mask) else None
         else:
             if attention_mask is not None:
                 if attention_mask.dim() == 2:
                     attention_mask_4d = attention_mask[:, None, None, :].expand(
                         batch_size, 1, seq_length, seq_length
                     ).to(dtype=inputs_embeds.dtype)
@@ -541,8 +650,13 @@ class DharaModel(DharaPreTrainedModel):
                         torch.tensor(float('-inf'), dtype=inputs_embeds.dtype, device=attention_mask_4d.device),
                         torch.tensor(0.0, dtype=inputs_embeds.dtype, device=attention_mask_4d.device)
                     )
         hidden_states = inputs_embeds
         all_hidden_states = () if output_hidden_states else None
         all_self_attns = () if output_attentions else None
         next_decoder_cache = None
@@ -554,8 +668,12 @@ class DharaModel(DharaPreTrainedModel):
             if self.gradient_checkpointing and self.training:
                 layer_outputs = self._gradient_checkpointing_func(
                     decoder_layer.__call__,
-                    hidden_states, attention_mask, position_ids,
-                    past_key_values, output_attentions, use_cache,
                 )
             else:
                 layer_outputs = decoder_layer(
@@ -571,6 +689,7 @@ class DharaModel(DharaPreTrainedModel):
             if use_cache:
                 next_decoder_cache = layer_outputs[2 if output_attentions else 1]
             if output_attentions:
                 all_self_attns += (layer_outputs[1],)
@@ -594,23 +713,36 @@ class DharaModel(DharaPreTrainedModel):
         )
     def add_noise_to_tokens(self, input_ids: torch.LongTensor, t: torch.FloatTensor, eps: float = None):
-        """MDM-style masking: Replace tokens with [MASK] based on noise level t."""
         batch_size, seq_len = input_ids.shape
         device = input_ids.device
         if eps is None:
             eps = getattr(self.config, 'mask_epsilon', 0.001)
         p_mask = (1 - eps) * t + eps
         p_mask = p_mask.unsqueeze(-1).expand(batch_size, seq_len)
         corruption_mask = torch.rand(batch_size, seq_len, device=device) < p_mask
-        noisy_input_ids = torch.where(corruption_mask, self.mask_token_id, input_ids)
         return noisy_input_ids, corruption_mask, p_mask
 class DharaForMaskedDiffusion(DharaPreTrainedModel, GenerationMixin):
-    """Dhara Model with Masked Diffusion head for training and inference"""
     _tied_weights_keys = ["lm_head.weight"]
     def __init__(self, config):
@@ -636,6 +768,9 @@ class DharaForMaskedDiffusion(DharaPreTrainedModel, GenerationMixin):
     def set_output_embeddings(self, new_embeddings):
         self.lm_head = new_embeddings
     def get_decoder(self):
         return self.model
@@ -655,7 +790,9 @@ class DharaForMaskedDiffusion(DharaPreTrainedModel, GenerationMixin):
         p_mask: Optional[torch.Tensor] = None,
     ) -> Union[Tuple, MaskedLMOutput]:
         output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
-        output_hidden_states = output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
         return_dict = return_dict if return_dict is not None else self.config.use_return_dict
         outputs = self.model(
@@ -693,9 +830,13 @@ class DharaForMaskedDiffusion(DharaPreTrainedModel, GenerationMixin):
         )
     def compute_diffusion_loss(self, logits, labels, corruption_mask=None, p_mask=None):
-        """MDM loss with p_mask importance weighting."""
         if corruption_mask is None or p_mask is None:
-            raise ValueError("MDM requires both corruption_mask and p_mask for loss computation.")
         loss = F.cross_entropy(
             logits.view(-1, self.config.vocab_size),
@@ -706,6 +847,7 @@ class DharaForMaskedDiffusion(DharaPreTrainedModel, GenerationMixin):
         masked_losses = loss[corruption_mask]
         masked_p_mask = p_mask[corruption_mask]
         weighted_losses = masked_losses / masked_p_mask
         total_positions = labels.shape[0] * labels.shape[1]
@@ -728,11 +870,15 @@ class DharaForMaskedDiffusion(DharaPreTrainedModel, GenerationMixin):
                 max_cache_length = None
             if attention_mask is not None and attention_mask.shape[1] > input_ids.shape[1]:
-                input_ids = input_ids[:, -(attention_mask.shape[1] - past_length):]
             elif past_length < input_ids.shape[1]:
                 input_ids = input_ids[:, past_length:]
-            if max_cache_length is not None and attention_mask is not None and cache_length + input_ids.shape[1] > max_cache_length:
                 attention_mask = attention_mask[:, -max_cache_length:]
         position_ids = kwargs.get("position_ids", None)
@@ -740,21 +886,32 @@ class DharaForMaskedDiffusion(DharaPreTrainedModel, GenerationMixin):
             position_ids = attention_mask.long().cumsum(-1) - 1
             position_ids.masked_fill_(attention_mask == 0, 1)
             if past_key_values:
-                position_ids = position_ids[:, -input_ids.shape[1]:]
         if inputs_embeds is not None and past_key_values is None:
             model_inputs = {"inputs_embeds": inputs_embeds}
         else:
             model_inputs = {"input_ids": input_ids}
-        model_inputs.update({
-            "position_ids": position_ids,
-            "past_key_values": past_key_values,
-            "use_cache": kwargs.get("use_cache"),
-            "attention_mask": attention_mask,
-        })
         return model_inputs
     @torch.no_grad()
     def generate(
         self,
@@ -764,27 +921,32 @@ class DharaForMaskedDiffusion(DharaPreTrainedModel, GenerationMixin):
         num_diffusion_steps: int = 10,
         temperature: float = 1.0,
         top_p: float = 0.9,
         do_sample: bool = True,
         pad_token_id: Optional[int] = None,
         eos_token_id: Optional[int] = None,
         **kwargs
     ) -> torch.LongTensor:
         """
-        Generate text using masked diffusion sampling.
-        This method performs iterative denoising: starting from fully masked tokens,
-        it progressively unmasks positions based on model confidence.
         Args:
             input_ids: Input prompt token IDs [batch_size, prompt_len]
             max_length: Maximum total sequence length (prompt + generation)
             max_new_tokens: Number of new tokens to generate (alternative to max_length)
-            num_diffusion_steps: Number of denoising iterations (more = higher quality, slower)
             temperature: Sampling temperature (higher = more random)
             top_p: Nucleus sampling threshold
             do_sample: Whether to sample or take argmax
             pad_token_id: Token ID for padding
             eos_token_id: Token ID for end of sequence
         Returns:
             Generated token IDs including the prompt
@@ -816,97 +978,134 @@ class DharaForMaskedDiffusion(DharaPreTrainedModel, GenerationMixin):
         if eos_token_id is None:
             eos_token_id = self.config.eos_token_id if hasattr(self.config, 'eos_token_id') else 2
-        # Initialize: prompt + masked tokens for generation
-        total_len = prompt_len + gen_len
-        tokens = torch.full((batch_size, total_len), mask_token_id, dtype=torch.long, device=device)
-        tokens[:, :prompt_len] = input_ids
-        # Track which positions are masked (need generation)
-        is_masked = torch.ones(batch_size, total_len, dtype=torch.bool, device=device)
-        is_masked[:, :prompt_len] = False  # Prompt is not masked
-        # Number of tokens to unmask per step
-        tokens_per_step = max(1, gen_len // num_diffusion_steps)
-        # Iterative denoising
-        for step in range(num_diffusion_steps):
-            # Forward pass to get logits
-            outputs = self(input_ids=tokens)
             logits = outputs.logits  # [batch, seq_len, vocab]
-            # Only consider masked positions
-            masked_positions = is_masked.clone()
-            if not masked_positions.any():
-                break  # All tokens have been generated
             # Apply temperature
-            if temperature != 1.0:
-                logits = logits / temperature
-            # Get probabilities
-            probs = F.softmax(logits, dim=-1)
-            # Calculate confidence (max prob) for each position
-            confidence, _ = probs.max(dim=-1)  # [batch, seq_len]
-            # Mask out already-generated positions from confidence calculation
-            confidence = confidence.masked_fill(~masked_positions, -float('inf'))
-            # Determine how many tokens to unmask this step
-            remaining_masked = masked_positions.sum(dim=1)  # [batch]
-            # For the last step, unmask everything remaining
-            if step == num_diffusion_steps - 1:
-                num_to_unmask = remaining_masked
             else:
-                num_to_unmask = torch.minimum(
-                    torch.tensor(tokens_per_step, device=device).expand(batch_size),
-                    remaining_masked
-                )
-            # For each batch item, unmask the highest confidence positions
-            for b in range(batch_size):
-                if num_to_unmask[b] == 0:
-                    continue
-                # Get confidence scores for this batch item
-                conf_b = confidence[b]  # [seq_len]
-                # Get top-k positions with highest confidence
-                k = int(num_to_unmask[b].item())
-                _, top_indices = conf_b.topk(k)
-                # Sample or argmax for these positions
-                for idx in top_indices:
-                    pos_logits = logits[b, idx]  # [vocab]
-                    if do_sample and temperature > 0:
-                        # Top-p (nucleus) sampling
-                        sorted_logits, sorted_indices = torch.sort(pos_logits, descending=True)
-                        sorted_probs = F.softmax(sorted_logits, dim=-1)
-                        cumsum_probs = torch.cumsum(sorted_probs, dim=-1)
-                        # Remove tokens with cumulative probability above top_p
-                        sorted_indices_to_remove = cumsum_probs > top_p
-                        sorted_indices_to_remove[1:] = sorted_indices_to_remove[:-1].clone()
-                        sorted_indices_to_remove[0] = False
-                        sorted_logits[sorted_indices_to_remove] = float('-inf')
-                        probs_filtered = F.softmax(sorted_logits, dim=-1)
-                        # Sample
-                        sampled_idx = torch.multinomial(probs_filtered, 1)
-                        token_id = sorted_indices[sampled_idx]
-                    else:
-                        # Greedy (argmax)
-                        token_id = pos_logits.argmax()
-                    tokens[b, idx] = token_id
-                    is_masked[b, idx] = False
-        return tokens
     def save_pretrained(self, save_directory, **kwargs):
         kwargs['safe_serialization'] = kwargs.get('safe_serialization', True)
         return super().save_pretrained(save_directory, **kwargs)

 #!/usr/bin/env python3
 """
+Dhara: Diffusion LLM with Canon Layers
+Combines:
+1. Dhara's masked diffusion training (bidirectional attention, high throughput)
+2. Canon layers (local context mixing via causal depthwise convolutions)
+Canon layers from "Physics of Language Models: Part 4.1" by Zeyuan Allen-Zhu:
+- Position A: After input LayerNorm, before attention
+- Position C: After post-attention LayerNorm, before MLP
+- kernel_size=4, residual=True, activation=False (default)
+Expected benefits:
+- ~280-290 tok/s throughput (Dhara's parallel generation)
+- +0.25-0.5% accuracy improvement (Canon's local context mixing)
 """
 import math
+import warnings
+from typing import Optional, Tuple, Union, List
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
+from torch.nn import CrossEntropyLoss
 from transformers import PreTrainedModel
 from transformers.generation import GenerationMixin
 except ImportError:
     FLASH_ATTN_AVAILABLE = False
+try:
+    import xformers.ops as xops
+    XFORMERS_AVAILABLE = True
+except ImportError:
+    XFORMERS_AVAILABLE = False
 class DharaConfig(PretrainedConfig):
     """
     Configuration for Dhara model.
+    Combines Dhara diffusion config with Canon layer parameters.
     """
     model_type = "dhara"
     def __init__(
         self,
         # Core architecture
+        vocab_size: int = 50304,
         hidden_size: int = 384,
         num_hidden_layers: int = 32,
+        num_attention_heads: int = 8,
+        num_key_value_heads: int = 4,
         intermediate_size: int = 1024,
         head_dim: int = None,
         max_position_embeddings: int = 2048,
         # Model specifics
         hidden_act: str = "silu",
+        rms_norm_eps: float = 1e-6,
         rope_theta: float = 10000.0,
         initializer_range: float = 0.02,
         tie_word_embeddings: bool = True,
         attention_dropout: float = 0.0,
         # Canon layer parameters
+        canon_set: str = "AC",  # Positions: A (before attn), C (before MLP)
+        canon_kernel: int = 4,  # Kernel size (2-4)
+        canon_residual: bool = True,  # Highly recommended
+        canon_activation: bool = False,  # NOT recommended for transformers
         canon_bias: bool = False,
         # Diffusion specific
+        mask_token_id: int = None,  # Will be set from tokenizer
+        mask_epsilon: float = 0.001,  # Minimum mask probability
         num_diffusion_steps: int = 1000,
         # Special tokens
         # Performance flags
         use_cache: bool = False,
+        use_flash_attention: bool = True,
         use_xformers: bool = False,
         **kwargs
             **kwargs
         )
+        # Core architecture
         self.vocab_size = vocab_size
         self.hidden_size = hidden_size
         self.num_hidden_layers = num_hidden_layers
         self.head_dim = head_dim or (hidden_size // num_attention_heads)
         self.max_position_embeddings = max_position_embeddings
+        # Model specifics
         self.hidden_act = hidden_act
         self.rms_norm_eps = rms_norm_eps
         self.rope_theta = rope_theta
         self.initializer_range = initializer_range
+        self.tie_word_embeddings = tie_word_embeddings
         self.attention_dropout = attention_dropout
+        # Canon parameters
         self.canon_set = canon_set
         self.canon_kernel = canon_kernel
         self.canon_residual = canon_residual
         self.canon_activation = canon_activation
         self.canon_bias = canon_bias
+        # Diffusion specific
+        self.mask_token_id = mask_token_id if mask_token_id is not None else (vocab_size - 1)
         self.mask_epsilon = mask_epsilon
         self.num_diffusion_steps = num_diffusion_steps
+        # Special tokens
+        self.bos_token_id = bos_token_id
+        self.eos_token_id = eos_token_id
+        self.pad_token_id = pad_token_id
+        # Performance
         self.use_cache = use_cache
         self.use_flash_attention = use_flash_attention
         self.use_xformers = use_xformers
 class CanonLayer(nn.Module):
+    """
+    Canon Layer: Causal 1D depthwise convolution for local context mixing.
+    From "Physics of Language Models: Part 4.1" by Zeyuan Allen-Zhu.
+    Captures local sequential dependencies with O(n) complexity.
+    """
     def __init__(
         self,
         self.use_residual = use_residual
         self.use_activation = use_activation
+        # Depthwise causal convolution
         self.conv = nn.Conv1d(
             in_channels=hidden_size,
             out_channels=hidden_size,
             kernel_size=kernel_size,
+            padding=kernel_size - 1,  # Causal (left-pad)
+            groups=hidden_size,  # Depthwise
             bias=use_bias,
         )
+        # Initialize for stability
         nn.init.normal_(self.conv.weight, mean=0.0, std=0.02)
         if use_bias:
             nn.init.zeros_(self.conv.bias)
     def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        """
+        Args:
+            hidden_states: [batch_size, seq_len, hidden_size]
+        Returns:
+            output: [batch_size, seq_len, hidden_size]
+        """
         batch_size, seq_len, hidden_size = hidden_states.shape
+        # Transpose for Conv1d: [B, H, L]
         x = hidden_states.transpose(1, 2)
+        # Apply conv with causal padding
         out = self.conv(x)
+        # Remove right padding to make it causal
         out = out[:, :, :seq_len]
+        # Optional activation
         if self.use_activation:
             out = F.silu(out)
+        # Transpose back: [B, L, H]
         out = out.transpose(1, 2)
+        # Residual connection
         if self.use_residual:
             out = hidden_states + out
         return out
     def _set_cos_sin_cache(self, seq_len, device, dtype):
         self.max_seq_len_cached = seq_len
         t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
         freqs = torch.outer(t, self.inv_freq)
         emb = torch.cat((freqs, freqs), dim=-1)
         self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
     def forward(self, x, seq_len=None):
         if seq_len > self.max_seq_len_cached:
             self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype)
         return (
             self.cos_cached[:seq_len].to(dtype=x.dtype),
             self.sin_cached[:seq_len].to(dtype=x.dtype),
 def rotate_half(x):
+    """Rotates half the hidden dims of the input."""
     x1 = x[..., : x.shape[-1] // 2]
     x2 = x[..., x.shape[-1] // 2 :]
     return torch.cat((-x2, x1), dim=-1)
 def apply_rotary_pos_emb(q, k, cos, sin, position_ids, unsqueeze_dim=1):
+    """Applies Rotary Position Embedding to query and key tensors."""
     cos = cos[position_ids].unsqueeze(unsqueeze_dim)
     sin = sin[position_ids].unsqueeze(unsqueeze_dim)
+    # Cast to input dtype for consistency
     cos = cos.to(q.dtype)
     sin = sin.to(q.dtype)
     q_embed = (q * cos) + (rotate_half(q) * sin)
     def __init__(self, config):
         super().__init__()
+        self.config = config
         self.hidden_size = config.hidden_size
         self.intermediate_size = config.intermediate_size
         self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
         self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
         self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
         self.act_fn = nn.SiLU()
     def forward(self, x):
 def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
+    """Repeat KV heads for GQA."""
     batch, num_key_value_heads, slen, head_dim = hidden_states.shape
     if n_rep == 1:
         return hidden_states
 class DharaAttention(nn.Module):
+    """Multi-Head Bidirectional Attention with GQA support (for diffusion)"""
     def __init__(self, config: DharaConfig, layer_idx: Optional[int] = None):
         super().__init__()
         self.num_key_value_groups = self.num_heads // self.num_key_value_heads
         self.max_position_embeddings = config.max_position_embeddings
         self.rope_theta = config.rope_theta
+        self.is_causal = False  # CRITICAL: Dhara uses bidirectional attention
+        if (self.head_dim * self.num_heads) != self.hidden_size:
+            raise ValueError(
+                f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
+                f" and `num_heads`: {self.num_heads})."
+            )
         self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False)
         self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=False)
         kv_seq_len = key_states.shape[-2]
         if past_key_value is not None:
+            if self.layer_idx is None:
+                raise ValueError(
+                    f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} "
+                    "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class "
+                    "with a layer index."
+                )
             kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
         cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
         key_states = repeat_kv(key_states, self.num_key_value_groups)
         value_states = repeat_kv(value_states, self.num_key_value_groups)
+        # Flash Attention for bidirectional
         if FLASH_ATTN_AVAILABLE and self.config.use_flash_attention:
             query_states = query_states.transpose(1, 2).contiguous()
             key_states = key_states.transpose(1, 2).contiguous()
                 value_states = value_states.to(torch.bfloat16)
             attn_output = flash_attn_func(
+                query_states,
+                key_states,
+                value_states,
+                dropout_p=0.0,
+                causal=False,  # Bidirectional for diffusion
             )
             attn_output = attn_output.view(bsz, q_len, self.hidden_size)
         else:
+            # Standard attention
             attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
             if attention_mask is not None:
                 attn_weights = attn_weights + attention_mask
             attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
             attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training)
             attn_output = torch.matmul(attn_weights, value_states)
             attn_output = attn_output.transpose(1, 2).contiguous()
             attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
 class DharaDecoderLayer(nn.Module):
+    """
+    Dhara decoder layer with Canon layers at positions A and C.
+    Flow:
+        x -> LayerNorm -> [CanonA] -> Attention -> + residual
+        x -> LayerNorm -> [CanonC] -> MLP -> + residual
+    """
     def __init__(self, config: DharaConfig, layer_idx: int):
         super().__init__()
         self.hidden_size = config.hidden_size
         self.config = config
+        # Pre-attention norm
         self.input_layernorm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        # Canon-A: before attention
         self.canon_a = None
         if "A" in config.canon_set:
             self.canon_a = CanonLayer(
                 use_bias=config.canon_bias,
             )
+        # Attention
         self.self_attn = DharaAttention(config=config, layer_idx=layer_idx)
+        # Post-attention norm
         self.post_attention_layernorm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        # Canon-C: before MLP
         self.canon_c = None
         if "C" in config.canon_set:
             self.canon_c = CanonLayer(
                 use_bias=config.canon_bias,
             )
+        # MLP
         self.mlp = DharaMLP(config)
     def forward(
         use_cache: Optional[bool] = False,
     ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
         residual = hidden_states
+        # Pre-attention layernorm
         hidden_states = self.input_layernorm(hidden_states)
+        # Canon-A (before attention)
         if self.canon_a is not None:
             hidden_states = self.canon_a(hidden_states)
+        # Self Attention (bidirectional)
         hidden_states, self_attn_weights, present_key_value = self.self_attn(
             hidden_states=hidden_states,
             attention_mask=attention_mask,
         )
         hidden_states = residual + hidden_states
+        # MLP block
         residual = hidden_states
         hidden_states = self.post_attention_layernorm(hidden_states)
+        # Canon-C (before MLP)
         if self.canon_c is not None:
             hidden_states = self.canon_c(hidden_states)
         hidden_states = residual + hidden_states
         outputs = (hidden_states,)
         if output_attentions:
             outputs += (self_attn_weights,)
         if use_cache:
             outputs += (present_key_value,)
 class DharaModel(DharaPreTrainedModel):
+    """
+    Dhara base model with bidirectional attention and Canon layers.
+    """
     def __init__(self, config: DharaConfig):
         super().__init__(config)
         self.layers = nn.ModuleList(
             [DharaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
         )
         self.norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
         self.gradient_checkpointing = False
         return_dict: Optional[bool] = None,
     ) -> Union[Tuple, BaseModelOutputWithPast]:
         output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
         use_cache = use_cache if use_cache is not None else self.config.use_cache
         return_dict = return_dict if return_dict is not None else self.config.use_return_dict
         if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
         elif input_ids is not None:
             batch_size, seq_length = input_ids.shape[:2]
         elif inputs_embeds is not None:
         else:
             raise ValueError("You have to specify either input_ids or inputs_embeds")
+        if self.gradient_checkpointing and self.training:
+            if use_cache:
+                logger.warning_once(
+                    "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
+                )
+                use_cache = False
         past_key_values_length = 0
         if use_cache:
         if self._use_flash_attention_2:
             attention_mask = attention_mask if (attention_mask is not None and 0 in attention_mask) else None
         else:
+            # Bidirectional attention mask (not causal)
             if attention_mask is not None:
                 if attention_mask.dim() == 2:
+                    batch_size, seq_length = attention_mask.shape
                     attention_mask_4d = attention_mask[:, None, None, :].expand(
                         batch_size, 1, seq_length, seq_length
                     ).to(dtype=inputs_embeds.dtype)
                         torch.tensor(float('-inf'), dtype=inputs_embeds.dtype, device=attention_mask_4d.device),
                         torch.tensor(0.0, dtype=inputs_embeds.dtype, device=attention_mask_4d.device)
                     )
+                else:
+                    attention_mask = attention_mask
+            else:
+                attention_mask = None
         hidden_states = inputs_embeds
         all_hidden_states = () if output_hidden_states else None
         all_self_attns = () if output_attentions else None
         next_decoder_cache = None
             if self.gradient_checkpointing and self.training:
                 layer_outputs = self._gradient_checkpointing_func(
                     decoder_layer.__call__,
+                    hidden_states,
+                    attention_mask,
+                    position_ids,
+                    past_key_values,
+                    output_attentions,
+                    use_cache,
                 )
             else:
                 layer_outputs = decoder_layer(
             if use_cache:
                 next_decoder_cache = layer_outputs[2 if output_attentions else 1]
             if output_attentions:
                 all_self_attns += (layer_outputs[1],)
         )
     def add_noise_to_tokens(self, input_ids: torch.LongTensor, t: torch.FloatTensor, eps: float = None):
+        """
+        MDM-style masking: Replace tokens with [MASK] based on noise level t.
+        Args:
+            input_ids: Input token IDs [batch_size, seq_len]
+            t: Noise level in [0, 1] [batch_size]
+            eps: Minimum mask probability
+        Returns:
+            Tuple of (noisy_input_ids, corruption_mask, p_mask)
+        """
         batch_size, seq_len = input_ids.shape
         device = input_ids.device
         if eps is None:
             eps = getattr(self.config, 'mask_epsilon', 0.001)
         p_mask = (1 - eps) * t + eps
         p_mask = p_mask.unsqueeze(-1).expand(batch_size, seq_len)
         corruption_mask = torch.rand(batch_size, seq_len, device=device) < p_mask
+        mask_token_id = self.mask_token_id
+        noisy_input_ids = torch.where(corruption_mask, mask_token_id, input_ids)
         return noisy_input_ids, corruption_mask, p_mask
 class DharaForMaskedDiffusion(DharaPreTrainedModel, GenerationMixin):
+    """Dhara Model with Masked Diffusion head for training"""
     _tied_weights_keys = ["lm_head.weight"]
     def __init__(self, config):
     def set_output_embeddings(self, new_embeddings):
         self.lm_head = new_embeddings
+    def set_decoder(self, decoder):
+        self.model = decoder
     def get_decoder(self):
         return self.model
         p_mask: Optional[torch.Tensor] = None,
     ) -> Union[Tuple, MaskedLMOutput]:
         output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
         return_dict = return_dict if return_dict is not None else self.config.use_return_dict
         outputs = self.model(
         )
     def compute_diffusion_loss(self, logits, labels, corruption_mask=None, p_mask=None):
+        """
+        MDM loss with p_mask importance weighting.
+        """
         if corruption_mask is None or p_mask is None:
+            raise ValueError(
+                "MDM requires both corruption_mask and p_mask for loss computation."
+            )
         loss = F.cross_entropy(
             logits.view(-1, self.config.vocab_size),
         masked_losses = loss[corruption_mask]
         masked_p_mask = p_mask[corruption_mask]
         weighted_losses = masked_losses / masked_p_mask
         total_positions = labels.shape[0] * labels.shape[1]
                 max_cache_length = None
             if attention_mask is not None and attention_mask.shape[1] > input_ids.shape[1]:
+                input_ids = input_ids[:, -(attention_mask.shape[1] - past_length) :]
             elif past_length < input_ids.shape[1]:
                 input_ids = input_ids[:, past_length:]
+            if (
+                max_cache_length is not None
+                and attention_mask is not None
+                and cache_length + input_ids.shape[1] > max_cache_length
+            ):
                 attention_mask = attention_mask[:, -max_cache_length:]
         position_ids = kwargs.get("position_ids", None)
             position_ids = attention_mask.long().cumsum(-1) - 1
             position_ids.masked_fill_(attention_mask == 0, 1)
             if past_key_values:
+                position_ids = position_ids[:, -input_ids.shape[1] :]
         if inputs_embeds is not None and past_key_values is None:
             model_inputs = {"inputs_embeds": inputs_embeds}
         else:
             model_inputs = {"input_ids": input_ids}
+        model_inputs.update(
+            {
+                "position_ids": position_ids,
+                "past_key_values": past_key_values,
+                "use_cache": kwargs.get("use_cache"),
+                "attention_mask": attention_mask,
+            }
+        )
         return model_inputs
+    @staticmethod
+    def _reorder_cache(past_key_values, beam_idx):
+        reordered_past = ()
+        for layer_past in past_key_values:
+            reordered_past += (
+                tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past),
+            )
+        return reordered_past
     @torch.no_grad()
     def generate(
         self,
         num_diffusion_steps: int = 10,
         temperature: float = 1.0,
         top_p: float = 0.9,
+        top_k: int = 50,
         do_sample: bool = True,
         pad_token_id: Optional[int] = None,
         eos_token_id: Optional[int] = None,
+        repetition_penalty: float = 1.2,
         **kwargs
     ) -> torch.LongTensor:
         """
+        Generate text using autoregressive sampling with the diffusion model.
+        Since this model was converted from AR to diffusion via WSD training,
+        we generate tokens one at a time left-to-right, using the model's
+        next-token predictions at each position.
         Args:
             input_ids: Input prompt token IDs [batch_size, prompt_len]
             max_length: Maximum total sequence length (prompt + generation)
             max_new_tokens: Number of new tokens to generate (alternative to max_length)
+            num_diffusion_steps: Number of refinement iterations per token (higher = better quality)
             temperature: Sampling temperature (higher = more random)
             top_p: Nucleus sampling threshold
+            top_k: Top-k sampling threshold
             do_sample: Whether to sample or take argmax
             pad_token_id: Token ID for padding
             eos_token_id: Token ID for end of sequence
+            repetition_penalty: Penalty for repeating tokens (>1 = less repetition)
         Returns:
             Generated token IDs including the prompt
         if eos_token_id is None:
             eos_token_id = self.config.eos_token_id if hasattr(self.config, 'eos_token_id') else 2
+        # Start with the prompt
+        generated = input_ids.clone()
+        # Track generated tokens for repetition penalty
+        generated_set = set()
+        for i in range(prompt_len):
+            for b in range(batch_size):
+                generated_set.add(input_ids[b, i].item())
+        # Generate tokens one at a time (autoregressive style)
+        for pos in range(gen_len):
+            # Add a mask token at the next position
+            current_seq = torch.cat([
+                generated,
+                torch.full((batch_size, 1), mask_token_id, dtype=torch.long, device=device)
+            ], dim=1)
+            # Get model predictions
+            outputs = self(input_ids=current_seq)
             logits = outputs.logits  # [batch, seq_len, vocab]
+            # Get logits for the last (masked) position
+            next_token_logits = logits[:, -1, :]  # [batch, vocab]
+            # Apply repetition penalty
+            if repetition_penalty != 1.0:
+                for b in range(batch_size):
+                    for prev_token in generated_set:
+                        if prev_token < next_token_logits.shape[1]:
+                            next_token_logits[b, prev_token] /= repetition_penalty
             # Apply temperature
+            if temperature != 1.0 and temperature > 0:
+                next_token_logits = next_token_logits / temperature
+            if do_sample and temperature > 0:
+                # Apply top-k filtering
+                if top_k > 0:
+                    indices_to_remove = next_token_logits < torch.topk(next_token_logits, top_k)[0][..., -1, None]
+                    next_token_logits[indices_to_remove] = float('-inf')
+                # Apply top-p (nucleus) filtering
+                if top_p < 1.0:
+                    sorted_logits, sorted_indices = torch.sort(next_token_logits, descending=True)
+                    cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
+                    # Remove tokens with cumulative probability above threshold
+                    sorted_indices_to_remove = cumulative_probs > top_p
+                    # Shift the indices to the right to keep the first token above threshold
+                    sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
+                    sorted_indices_to_remove[..., 0] = False
+                    # Scatter sorted indices to original indexing
+                    indices_to_remove = sorted_indices_to_remove.scatter(1, sorted_indices, sorted_indices_to_remove)
+                    next_token_logits[indices_to_remove] = float('-inf')
+                # Sample from the filtered distribution
+                probs = F.softmax(next_token_logits, dim=-1)
+                next_tokens = torch.multinomial(probs, num_samples=1).squeeze(-1)
             else:
+                # Greedy decoding
+                next_tokens = next_token_logits.argmax(dim=-1)
+            # Add to generated sequence
+            generated = torch.cat([generated, next_tokens.unsqueeze(-1)], dim=1)
+            # Update generated set for repetition penalty
+            for b in range(batch_size):
+                generated_set.add(next_tokens[b].item())
+            # Check for EOS
+            if eos_token_id is not None and (next_tokens == eos_token_id).all():
+                break
+        return generated
     def save_pretrained(self, save_directory, **kwargs):
+        """Override to save in SafeTensors format by default"""
         kwargs['safe_serialization'] = kwargs.get('safe_serialization', True)
         return super().save_pretrained(save_directory, **kwargs)
+def count_parameters(model):
+    """Count total and Canon-specific parameters."""
+    total = sum(p.numel() for p in model.parameters())
+    canon = sum(p.numel() for n, p in model.named_parameters() if 'canon' in n.lower())
+    return total, canon
+if __name__ == "__main__":
+    # Quick test
+    print("Testing Dhara model creation...")
+    config = DharaConfig(
+        vocab_size=50304,
+        hidden_size=384,
+        num_hidden_layers=32,
+        num_attention_heads=8,
+        num_key_value_heads=4,
+        intermediate_size=1024,
+        canon_set="AC",
+        canon_kernel=4,
+        canon_residual=True,
+    )
+    model = DharaForMaskedDiffusion(config)
+    total, canon = count_parameters(model)
+    print(f"Model created successfully!")
+    print(f"Total params: {total:,} ({total/1e6:.2f}M)")
+    print(f"Canon params: {canon:,} ({100*canon/total:.3f}%)")
+    print(f"Base Dhara would be: {total - canon:,}")
+    # Test forward pass
+    batch_size, seq_len = 2, 64
+    input_ids = torch.randint(0, 50304, (batch_size, seq_len))
+    # Test with diffusion noise
+    t = torch.rand(batch_size)
+    noisy_ids, corruption_mask, p_mask = model.add_noise_to_tokens(input_ids, t)
+    with torch.no_grad():
+        outputs = model(
+            input_ids=noisy_ids,
+            labels=input_ids,
+            corruption_mask=corruption_mask,
+            p_mask=p_mask,
+        )
+    print(f"Forward pass: loss={outputs.loss.item():.4f}")
+    print("Ready for training!")

tokenizer.json CHANGED Viewed

@@ -1,21 +1,7 @@
 {
   "version": "1.0",
-  "truncation": {
-    "direction": "Right",
-    "max_length": 2048,
-    "strategy": "LongestFirst",
-    "stride": 0
-  },
-  "padding": {
-    "strategy": {
-      "Fixed": 2048
-    },
-    "direction": "Right",
-    "pad_to_multiple_of": null,
-    "pad_id": 50256,
-    "pad_type_id": 0,
-    "pad_token": "<|endoftext|>"
-  },
   "added_tokens": [
     {
       "id": 50256,

 {
   "version": "1.0",
+  "truncation": null,
+  "padding": null,
   "added_tokens": [
     {
       "id": 50256,