Model Card for Qwen/QwQ-32B-LMUL

This model is a derivative of Qwen/QwQ-32B, modified to use a custom attention mechanism defined by the l_mul_attention function from the lmul library.

Model Details

Original Model: Qwen/QwQ-32B
Architecture: qwen2
Modification: The forward method of the Qwen2Attention module has been replaced (monkey-patched) with a custom implementation that utilizes the l_mul_attention logic.

Scientific Rationale

This model was modified as part of a research project investigating alternative attention mechanisms in large language models. The l_mul_attention function implements a novel approach to calculating attention scores, and this model serves as a test case for evaluating its performance, efficiency, and impact on reasoning and generation tasks compared to the standard attention implementation.

By releasing this model, we hope to encourage further research into non-standard attention mechanisms and provide a practical example for the community to build upon.

How to Get Started

You can use this model with the standard transformers library pipeline. Ensure you have transformers, torch, and accelerate installed.

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Make sure to log in with your Hugging Face token if the model is private
# from huggingface_hub import login
# login("your-hf-token")

model_id = "YOUR_HF_USERNAME/QwQ-32B_LMUL" # Replace with your Hugging Face username
device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

prompt = "How many r's are in the word \"strawberry\""
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

print(response)

Intended Uses & Limitations

This model is intended primarily for research purposes. Its performance on standard benchmarks has not been fully evaluated. The custom attention mechanism may introduce unexpected behaviors or limitations not present in the original Qwen/QwQ-32B model.

Licensing Information

This model is released under the apache-2.0 license, which is the same license as the base model, Qwen/QwQ-32B. By using this model, you agree to the terms of the original license. It is your responsibility to ensure compliance with all applicable licenses and regulations.

Downloads last month: 6

Model tree for Peacemann/Qwen_QwQ-32B_LMUL

Base model

Qwen/Qwen2.5-32B

Finetuned

Qwen/QwQ-32B

Finetuned

(86)

this model

Collection including Peacemann/Qwen_QwQ-32B_LMUL

LMUL-Optimized-Models

Collection

10 items • Updated Jun 15