Model Card for Qwen/QwQ-32B-LMUL
This model is a derivative of Qwen/QwQ-32B, modified to use a custom attention mechanism defined by the l_mul_attention function from the lmul library.
Model Details
- Original Model: Qwen/QwQ-32B
- Architecture: qwen2
- Modification: The
forwardmethod of theQwen2Attentionmodule has been replaced (monkey-patched) with a custom implementation that utilizes thel_mul_attentionlogic.
Scientific Rationale
This model was modified as part of a research project investigating alternative attention mechanisms in large language models. The l_mul_attention function implements a novel approach to calculating attention scores, and this model serves as a test case for evaluating its performance, efficiency, and impact on reasoning and generation tasks compared to the standard attention implementation.
By releasing this model, we hope to encourage further research into non-standard attention mechanisms and provide a practical example for the community to build upon.
How to Get Started
You can use this model with the standard transformers library pipeline. Ensure you have transformers, torch, and accelerate installed.
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Make sure to log in with your Hugging Face token if the model is private
# from huggingface_hub import login
# login("your-hf-token")
model_id = "YOUR_HF_USERNAME/QwQ-32B_LMUL" # Replace with your Hugging Face username
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
prompt = "How many r's are in the word \"strawberry\""
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=512
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
Intended Uses & Limitations
This model is intended primarily for research purposes. Its performance on standard benchmarks has not been fully evaluated. The custom attention mechanism may introduce unexpected behaviors or limitations not present in the original Qwen/QwQ-32B model.
Licensing Information
This model is released under the apache-2.0 license, which is the same license as the base model, Qwen/QwQ-32B. By using this model, you agree to the terms of the original license. It is your responsibility to ensure compliance with all applicable licenses and regulations.
- Downloads last month
- 6