T-pro-it-2.1
π¨ Users are advised to exercise caution and are responsible for any additional training and oversight required to ensure the model's responses meet acceptable ethical and safety standards. The responsibility for incorporating this model into industrial or commercial solutions lies entirely with those who choose to deploy it.
Highlights
We introduce the updated version of the T-pro-it-2.0, named T-pro-it-2.1, featuring the following key enhancements:
Stronger instruction following: Significant gains in following to complex and strict instructions, outperforming T-pro-it-2.0 by +9 percentage points.
Improved general capabilities: Better comprehension and fluency in open-domain tasks, including chat and multistep content generation.
Advanced tool-calling proficiency: Robust performance in tool-calling workflows, achieving results on par with Qwen3-235B-2507.
Efficient inference: Faster response generation for Russian text via an optimized tokenizer (same as in T-pro-it-2.0).
Description
T-pro-it-2.1 β is an efficient russian model built upon the Qwen 3 model family with improved instruction following and tool-calling capabilities compared to T-pro-it-2.0. Outperforms Qwen3-32B in tool calling scenarios, which is essential for agentic applications. Built for both general tasks and complex workflows.
More train details in our Habr: https://habr.com/ru/companies/tbank/articles/979650/
NOTE: This model supports only non-thinking mode and does not generate <think></think> blocks in its output. Meanwhile, specifying enable_thinking=False is no longer required.
π Dataset
Instruction midtraining: 40B tokens of instruction data.
Supervised Fine-Tuning (SFT): ~670K high-quality and diverse instructions with balanced complexity combining general data, synthetic verifiable instruction-following and tool-calling scenarios.
Online RL alignment (GRPO): Synthetic data generated for instruction-following (IF) and tool-calling optimization.
- General stream: general and chat tasks;
- IF stream: Diverse, verifiable synthetic tasks targeting strict instruction following;
- Tool-calling stream: Complex workflows with multi-step tool use; strong gains on tool-calling benchmarks.
Merge Strategy
In this release, we leveraged an expert merging approach. After a shared SFT stage β which includes data for core capabilities (Instruction Following, General tasks, and Tool Calling) β we train three specialized experts via GRPO:
- IF Expert: Optimized for strict instruction following.
- General Expert: Focused on general and chat tasks.
- Tool-Call Expert: Trained on complex tool-calling workflows.
Each expert is trained with domain-specific data, hyperparameters, and reward functions for optimal performance. The final model is obtained by merging the three experts using SLERP (Spherical Linear Interpolation), enabling better preservation of individual capabilities compared to single-model training. To prevent artifacts after merging, we apply polishing stage using general domain to slightly adjust the model weights.
This approach allows fine-grained control over each skill domain and results in a more balanced and capable unified model.
π Benchmarks
| Model | Ru Arena Hard | ruIFeval* | enIFeval* | enBFCL | ruBFCL | Tau2 | ACEBench |
|---|---|---|---|---|---|---|---|
| T-pro-it-2.1 | 93.8 | 80.7 | 78.4 | 72.3 | 66.0 | 37.6 | 73.6 |
| T-pro-it-2.0 | 90.4 | 69.3 | 70.2 | 59.7 | 47.5 | 25.0 | 61.2 |
| Qwen3-32B | 87.3 | 77.4 | 77.7 | 69.2 | 57.3 | 39.3 | 65.0 |
| Devstral-Small-2-24B-Instruct-2512 | 75.7 | 71.3 | 71.3 | 63.1 | 57.0 | β | 64.3 |
| gpt-oss-20b | 73.6 | 71.1 | 67.6 | 50.0 | 37.6 | 48.7 | β |
| RuadaptQwen3-32B-Instruct | 65.4 | 70.8 | 73.5 | β | β | β | 62.2 |
Instruction Following: +9 percentage points improvement over T-pro-it-2.0. Tool-calling Tasks: Performance on par with Qwen3-235B-2507 on tool-calling benchmarks.
* IFeval metric is mean of 4 values: prompt and instruct levels for strict and loose accuracy.
More benchmarks can be found in our Habr post.
Recommended Generation Parameters
temperature: 0.7
top_p: 0.8
tok_k: 20
presence_penalty: 1.0
- Use lower temperature for straightforward queries and higher temperature for complex or creative tasks.
- A presence_penalty between 0 and 2 can help avoid repetitive outputs.
π¨βπ» Examples of usage
SGLang Usage
For better quality and stable performance, we recommend SGLang as your inference framework.
To run an inference server for T-pro-it-2.1, start by launching the SGLang server:
python -m sglang.launch_server \
--model-path t-tech/T-pro-it-2.1 \
--tool-call-parser qwen25
VLLM Usage
vllm serve t-tech/T-pro-it-2.1 \
--enable-auto-tool-choice \
--tool-call-parser hermes
Once the server is up and listening on host, you can send chat-based requests via the OpenAI Python client.
# ΠΠΏΠΈΡΠ°Π½ΠΈΠ΅ ΠΈΠ½ΡΡΡΡΠΌΠ΅Π½ΡΠ° Π΄Π»Ρ ΠΏΠΎΠ»ΡΡΠ΅Π½ΠΈΡ ΠΏΠΎΠ³ΠΎΠ΄Ρ
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "ΠΠΎΠ»ΡΡΠΈΡΡ ΠΊΡΠ°ΡΠΊΠΎΠ΅ ΠΎΠΏΠΈΡΠ°Π½ΠΈΠ΅ ΡΠ΅ΠΊΡΡΠ΅ΠΉ ΠΏΠΎΠ³ΠΎΠ΄Ρ Π² ΡΠΊΠ°Π·Π°Π½Π½ΠΎΠΌ Π³ΠΎΡΠΎΠ΄Π΅.",
"parameters": {
"type": "object",
"properties": {
"city": {
"type": "string",
"description": "ΠΠΎΡΠΎΠ΄, Π½Π°ΠΏΡΠΈΠΌΠ΅Ρ 'ΠΠΎΡΠΊΠ²Π°'."
},
"date": {
"type": "string",
"description": "ΠΠ°ΡΠ° Π² ΡΠΎΡΠΌΠ°ΡΠ΅ YYYY-MM-DD (ΠΎΠΏΡΠΈΠΎΠ½Π°Π»ΡΠ½ΠΎ)."
},
},
"required": ["city"],
},
},
}
]
prompt = (
"ΠΠ½Π΅ Π½ΡΠΆΠ½ΠΎ ΡΠΏΠ»Π°Π½ΠΈΡΠΎΠ²Π°ΡΡ ΠΏΡΠΎΠ³ΡΠ»ΠΊΡ ΠΏΠΎ ΠΠΎΡΠΊΠ²Π΅ ΡΠ΅Π³ΠΎΠ΄Π½Ρ Π²Π΅ΡΠ΅ΡΠΎΠΌ. "
"ΠΡΠ»ΠΈ ΡΠ΅Π±Π΅ Π½ΡΠΆΠ½ΠΎ, ΠΎΠ±ΡΠ°ΡΠΈΡΡ ΠΊ ΠΈΠ½ΡΡΡΡΠΌΠ΅Π½ΡΡ ΠΏΠΎΠ³ΠΎΠ΄Ρ, ΡΡΠΎΠ±Ρ ΡΠ·Π½Π°ΡΡ ΡΠ΅ΠΊΡΡΠΈΠ΅ ΡΡΠ»ΠΎΠ²ΠΈΡ, "
"Π° Π·Π°ΡΠ΅ΠΌ ΠΏΡΠ΅Π΄Π»ΠΎΠΆΠΈ, ΡΡΠΎ ΠΌΠΎΠΆΠ½ΠΎ Π΄Π΅Π»Π°ΡΡ Π½Π° ΡΠ»ΠΈΡΠ΅ ΠΈ ΠΊΠ°ΠΊΠΈΠ΅ Π΅ΡΡΡ Π°Π»ΡΡΠ΅ΡΠ½Π°ΡΠΈΠ²Ρ, Π΅ΡΠ»ΠΈ Π±ΡΠ΄Π΅Ρ Π΄ΠΎΠΆΠ΄Ρ."
)
completion = client.chat.completions.create(
model="ANY", # ΡΠ΅ΡΠ²Π΅Ρ ΠΈΠ³Π½ΠΎΡΠΈΡΡΠ΅Ρ ΠΈΠΌΡ ΠΌΠΎΠ΄Π΅Π»ΠΈ
messages=[
{
"role": "system",
"content": "Π’Ρ T-pro, Π²ΠΈΡΡΡΠ°Π»ΡΠ½ΡΠΉ Π°ΡΡΠΈΡΡΠ΅Π½Ρ Π² Π’-Π’Π΅Ρ
Π½ΠΎΠ»ΠΎΠ³ΠΈΡΡ
. Π’Π²ΠΎΡ Π·Π°Π΄Π°ΡΠ° β Π±ΡΡΡ ΠΏΠΎΠ»Π΅Π·Π½ΡΠΌ Π΄ΠΈΠ°Π»ΠΎΠ³ΠΎΠ²ΡΠΌ Π°ΡΡΠΈΡΡΠ΅Π½ΡΠΎΠΌ."
},
{"role": "user", "content": prompt},
],
tools=tools,
tool_choice="auto", # ΠΌΠΎΠ΄Π΅Π»Ρ ΡΠ°ΠΌΠ° ΡΠ΅ΡΠ°Π΅Ρ, Π²ΡΠ·ΡΠ²Π°ΡΡ Π»ΠΈ ΠΈΠ½ΡΡΡΡΠΌΠ΅Π½Ρ
temperature=0.7,
top_p=0.8,
top_k=20,
presence_penalty=1.0,
)
# Π ΠΏΠ΅ΡΠ²ΠΎΠΌ ΠΎΡΠ²Π΅ΡΠ΅ ΠΌΠΎΠ΄Π΅Π»Ρ Π»ΠΈΠ±ΠΎ Π΄Π°ΡΡ Π³ΠΎΡΠΎΠ²ΡΠΉ ΡΠ΅ΠΊΡΡ,
# Π»ΠΈΠ±ΠΎ Π²Π΅ΡΠ½Π΅Ρ Π·Π°ΠΏΡΠΎΡ Π½Π° Π²ΡΠ·ΠΎΠ² ΠΈΠ½ΡΡΡΡΠΌΠ΅Π½ΡΠ° (tool_calls)
message = completion.choices[0].message
print(message)
Note: It is obligatory to include both temperature and presence_penalty in every completion call.
HF Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
torch.manual_seed(42)
model_name = "t-tech/T-pro-it-2.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto",
)
prompt = (
"ΠΠ½Π΅ Π½ΡΠΆΠ½ΠΎ ΡΠΏΠ»Π°Π½ΠΈΡΠΎΠ²Π°ΡΡ ΠΏΡΠΎΠ³ΡΠ»ΠΊΡ ΠΏΠΎ ΠΠΎΡΠΊΠ²Π΅ ΡΠ΅Π³ΠΎΠ΄Π½Ρ Π²Π΅ΡΠ΅ΡΠΎΠΌ. "
"ΠΡΠ΅Π΄Π»ΠΎΠΆΠΈ Π²Π°ΡΠΈΠ°Π½ΡΡ Π·Π°Π½ΡΡΠΈΠΉ Π½Π° ΡΠ»ΠΈΡΠ΅ ΠΈ Π² ΠΏΠΎΠΌΠ΅ΡΠ΅Π½ΠΈΠΈ, "
"ΠΏΡΠ΅Π΄ΠΏΠΎΠ»Π°Π³Π°Ρ ΡΠΈΠΏΠΈΡΠ½ΡΡ ΠΏΠΎΠ³ΠΎΠ΄Ρ Π΄Π»Ρ ΡΡΠΎΠ³ΠΎ Π²ΡΠ΅ΠΌΠ΅Π½ΠΈ Π³ΠΎΠ΄Π°."
)
messages = [
{
"role": "system",
"content": "Π’Ρ T-pro, Π²ΠΈΡΡΡΠ°Π»ΡΠ½ΡΠΉ Π°ΡΡΠΈΡΡΠ΅Π½Ρ Π² Π’-Π’Π΅Ρ
Π½ΠΎΠ»ΠΎΠ³ΠΈΡΡ
. Π’Π²ΠΎΡ Π·Π°Π΄Π°ΡΠ° β Π±ΡΡΡ ΠΏΠΎΠ»Π΅Π·Π½ΡΠΌ Π΄ΠΈΠ°Π»ΠΎΠ³ΠΎΠ²ΡΠΌ Π°ΡΡΠΈΡΡΠ΅Π½ΡΠΎΠΌ."
},
{"role": "user", "content": prompt},
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=512,
)
# ΠΡΠ±ΡΠ°ΡΡΠ²Π°Π΅ΠΌ ΡΠΎΠΊΠ΅Π½Ρ ΠΏΡΠΎΠΌΠΏΡΠ°
generated_ids = [
output_ids[len(input_ids):]
for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
Long Context Usage
T-pro-it-2.1 natively supports a context length of 32,768 tokens.
For conversations where the input significantly exceeds this limit, follow the recommendations from the Qwen3 model card on processing long texts.
Modify the model files: In the
config.jsonfile, add therope_scalingfields:{ ..., "rope_scaling": { "rope_type": "yarn", "factor": 4.0, "original_max_position_embeddings": 32768 } }For
llama.cpp, you need to regenerate the GGUF file after the modification.Passing command line arguments:
For
vllm, you can usevllm serve ... --rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' --max-model-len 131072For
sglang, you can usepython -m sglang.launch_server ... --json-model-override-args '{"rope_scaling":{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}}'For
llama-serverfromllama.cpp, you can usellama-server ... --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768
Citation
If you find our work helpful, feel free to give us a cite.
@misc{stoianov2025tpro20efficientrussian,
title={T-pro 2.0: An Efficient Russian Hybrid-Reasoning Model and Playground},
author={Dmitrii Stoianov and Danil Taranets and Olga Tsymboi and Ramil Latypov and Almaz Dautov and Vladislav Kruglikov and Nikita Surkov and German Abramov and Pavel Gein and Dmitry Abulkhanov and Mikhail Gashkov and Viktor Zelenkovskiy and Artem Batalov and Aleksandr Medvedev and Anatolii Potapov},
year={2025},
eprint={2512.10430},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2512.10430},
}
- Downloads last month
- 39
