QwQ-32B
QwQ-32B is a 32.5B-parameter causal reasoning model from the Qwen series, post-trained with supervised fine-tuning and reinforcement learning to think explicitly before answering. Despite its mid-range size, it delivers performance competitive with leading reasoning systems such as DeepSeek-R1 and o1-mini, particularly on hard math, coding, and multi-step problems. It supports a native 131,072-token context (with YaRN scaling for inputs beyond 8,192 tokens) and is best driven with non-greedy sampling (Temperature 0.6, TopP 0.95, TopK 20–40).
QwQ-32B

Summary
- Introduction
- Model Overview
- Usage Guidelines
- Parameters
- Usage for Developers
- Reference
- License
- Citation
Introduction
The Qwen team has released QwQ-32B, a medium-sized reasoning model within the Qwen series. Unlike conventional instruction-tuned models, QwQ is trained to perform explicit thinking and reasoning before producing a final answer, yielding substantially stronger performance on downstream tasks that involve hard problems. Despite its relatively modest size, QwQ-32B achieves results competitive with leading reasoning systems such as DeepSeek-R1 and o1-mini.

Model Overview
QwQ-32B has the following characteristics:
- Type: Causal Language Model
- Training Stage: Pretraining & Post-training (Supervised Fine-Tuning and Reinforcement Learning)
- Architecture: Transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias
- Number of Parameters: 32.5B
- Number of Parameters (Non-Embedding): 31.0B
- Number of Layers: 64
- Number of Attention Heads (GQA): 40 for Q and 8 for KV
- Context Length: 131,072 tokens natively
- For prompts exceeding 8,192 tokens, YaRN scaling must be enabled as described in the Usage Guidelines.
Note: For the best experience, please review the Usage Guidelines before deploying QwQ-32B.
Usage Guidelines
To achieve optimal performance, the original authors recommend the following settings.
Enforce Thoughtful Output
- Ensure the model begins generation with
<think>\nto prevent empty thinking content, which can degrade output quality. Whenapply_chat_templateis used withadd_generation_prompt=True, this is handled automatically; the resulting response may omit the leading<think>tag in the visible output, which is expected behavior.
Sampling Parameters
- Use
Temperature=0.6,TopP=0.95, andMinP=0instead of greedy decoding to avoid endless repetitions. - Set
TopKbetween 20 and 40 to filter out rare token occurrences while preserving output diversity. - For supported frameworks, the
presence_penaltyparameter may be set between 0 and 2 to further reduce repetition. Higher values may introduce occasional language mixing and a slight performance drop.
No Thinking Content in History
- In multi-turn conversations, the historical model output should include only the final answer and should not retain the thinking content. This behavior is already implemented in
apply_chat_template.
Standardize Output Format
When benchmarking, the original authors recommend using prompts to standardize model outputs:
- Math Problems: Include
"Please reason step by step, and put your final answer within \boxed{}."in the prompt. - Multiple-Choice Questions: Append the following instruction to standardize responses:
"Please show your choice in the \answer` field with only the choice letter, e.g., "answer": "C"."`.
Handle Long Inputs
For inputs exceeding 8,192 tokens, enable YaRN to improve the model's ability to capture long-sequence information. For supported frameworks, the following block can be added to config.json:
{
...,
"rope_scaling": {
"factor": 4.0,
"original_max_position_embeddings": 32768,
"type": "yarn"
}
}
For deployment, the original authors recommend vLLM. Refer to the official vLLM documentation for usage details. Note that vLLM currently supports only static YaRN, meaning the scaling factor remains constant regardless of input length and may impact performance on shorter texts. It is therefore advisable to add the rope_scaling configuration only when long-context processing is required.
Parameters
Input
prompt(text): A user-provided input, which can be a question, a request, or context to be addressed. The model uses the prompt to analyze, infer, and generate a corresponding response.
Output
output(text): The generated text returned by the model as its final response to the user.
Usage for Developers
The following details describe how to access and run the model on our platform.
Requirements
pip install -r requirements.txt
Code based on AIOZ structure
from transformers import AutoModelForCausalLM, AutoTokenizer
import os
...
def do_ai_task(
prompt: Union[str, Path],
model_storage_directory: Union[str, Path],
device: Literal["cpu", "cuda", "gpu"] = "cpu",
*args, **kwargs) -> Any:
"""Define AI task: load model, pre-process, post-process, etc."""
# Define AI task workflow. Below is an example.
model_weight_path = os.path.join(model_storage_directory, "QwQ-32B")
# Load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_weight_path)
model = AutoModelForCausalLM.from_pretrained(
model_weight_path,
torch_dtype="auto",
device_map="auto"
)
...
output_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
return output_text
Reference
This repository is based on and inspired by Qwen's work. We sincerely appreciate their generosity in sharing their model and code.
License
We respect and comply with the terms of the original author's license cited in the Reference section.
Citation
@misc{qwq32b,
title = {QwQ-32B: Embracing the Power of Reinforcement Learning},
url = {https://qwenlm.github.io/blog/qwq-32b/},
author = {Qwen Team},
month = {March},
year = {2025}
}
@article{qwen2.5,
title={Qwen2.5 Technical Report},
author={An Yang and Baosong Yang and Beichen Zhang and Binyuan Hui and Bo Zheng and Bowen Yu and Chengyuan Li and Dayiheng Liu and Fei Huang and Haoran Wei and Huan Lin and Jian Yang and Jianhong Tu and Jianwei Zhang and Jianxin Yang and Jiaxi Yang and Jingren Zhou and Junyang Lin and Kai Dang and Keming Lu and Keqin Bao and Kexin Yang and Le Yu and Mei Li and Mingfeng Xue and Pei Zhang and Qin Zhu and Rui Men and Runji Lin and Tianhao Li and Tianyi Tang and Tingyu Xia and Xingzhang Ren and Xuancheng Ren and Yang Fan and Yang Su and Yichang Zhang and Yu Wan and Yuqiong Liu and Zeyu Cui and Zhenru Zhang and Zihan Qiu},
journal={arXiv preprint arXiv:2412.15115},
year={2024}
}