QwQ-32B

Summary

Introduction
Model Overview
Usage Guidelines
Parameters
Usage for Developers
Reference
License
Citation

Introduction

The Qwen team has released QwQ-32B, a medium-sized reasoning model within the Qwen series. Unlike conventional instruction-tuned models, QwQ is trained to perform explicit thinking and reasoning before producing a final answer, yielding substantially stronger performance on downstream tasks that involve hard problems. Despite its relatively modest size, QwQ-32B achieves results competitive with leading reasoning systems such as DeepSeek-R1 and o1-mini.

Model Overview

QwQ-32B has the following characteristics:

Type: Causal Language Model
Training Stage: Pretraining & Post-training (Supervised Fine-Tuning and Reinforcement Learning)
Architecture: Transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias
Number of Parameters: 32.5B
Number of Parameters (Non-Embedding): 31.0B
Number of Layers: 64
Number of Attention Heads (GQA): 40 for Q and 8 for KV
Context Length: 131,072 tokens natively
- For prompts exceeding 8,192 tokens, YaRN scaling must be enabled as described in the Usage Guidelines.

Note: For the best experience, please review the Usage Guidelines before deploying QwQ-32B.

Usage Guidelines

To achieve optimal performance, the original authors recommend the following settings.

Enforce Thoughtful Output

Ensure the model begins generation with <think>\n to prevent empty thinking content, which can degrade output quality. When apply_chat_template is used with add_generation_prompt=True, this is handled automatically; the resulting response may omit the leading <think> tag in the visible output, which is expected behavior.

Sampling Parameters

Use Temperature=0.6, TopP=0.95, and MinP=0 instead of greedy decoding to avoid endless repetitions.
Set TopK between 20 and 40 to filter out rare token occurrences while preserving output diversity.
For supported frameworks, the presence_penalty parameter may be set between 0 and 2 to further reduce repetition. Higher values may introduce occasional language mixing and a slight performance drop.

No Thinking Content in History

In multi-turn conversations, the historical model output should include only the final answer and should not retain the thinking content. This behavior is already implemented in apply_chat_template.

Standardize Output Format

When benchmarking, the original authors recommend using prompts to standardize model outputs:

Math Problems: Include "Please reason step by step, and put your final answer within \boxed{}." in the prompt.
Multiple-Choice Questions: Append the following instruction to standardize responses: "Please show your choice in the \answer` field with only the choice letter, e.g., "answer": "C"."`.

Handle Long Inputs

For inputs exceeding 8,192 tokens, enable YaRN to improve the model's ability to capture long-sequence information. For supported frameworks, the following block can be added to config.json:

{
    ...,
    "rope_scaling": {
        "factor": 4.0,
        "original_max_position_embeddings": 32768,
        "type": "yarn"
    }
}

For deployment, the original authors recommend vLLM. Refer to the official vLLM documentation for usage details. Note that vLLM currently supports only static YaRN, meaning the scaling factor remains constant regardless of input length and may impact performance on shorter texts. It is therefore advisable to add the rope_scaling configuration only when long-context processing is required.

Parameters

Input

prompt (text): A user-provided input, which can be a question, a request, or context to be addressed. The model uses the prompt to analyze, infer, and generate a corresponding response.

Output

output (text): The generated text returned by the model as its final response to the user.

Usage for Developers

The following details describe how to access and run the model on our platform.

Requirements

pip install -r requirements.txt

Code based on AIOZ structure

from transformers import AutoModelForCausalLM, AutoTokenizer
import os

...

def do_ai_task(
        prompt: Union[str, Path],
        model_storage_directory: Union[str, Path],
        device: Literal["cpu", "cuda", "gpu"] = "cpu",
        *args, **kwargs) -> Any:
    """Define AI task: load model, pre-process, post-process, etc."""
    # Define AI task workflow. Below is an example.
    model_weight_path = os.path.join(model_storage_directory, "QwQ-32B")

    # Load the tokenizer and the model
    tokenizer = AutoTokenizer.from_pretrained(model_weight_path)
    model = AutoModelForCausalLM.from_pretrained(
        model_weight_path,
        torch_dtype="auto",
        device_map="auto"
    )

    ...

    output_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    return output_text

Reference

This repository is based on and inspired by Qwen's work. We sincerely appreciate their generosity in sharing their model and code.

License

We respect and comply with the terms of the original author's license cited in the Reference section.

Citation

@misc{qwq32b,
    title = {QwQ-32B: Embracing the Power of Reinforcement Learning},
    url = {https://qwenlm.github.io/blog/qwq-32b/},
    author = {Qwen Team},
    month = {March},
    year = {2025}
}

@article{qwen2.5,
    title={Qwen2.5 Technical Report},
    author={An Yang and Baosong Yang and Beichen Zhang and Binyuan Hui and Bo Zheng and Bowen Yu and Chengyuan Li and Dayiheng Liu and Fei Huang and Haoran Wei and Huan Lin and Jian Yang and Jianhong Tu and Jianwei Zhang and Jianxin Yang and Jiaxi Yang and Jingren Zhou and Junyang Lin and Kai Dang and Keming Lu and Keqin Bao and Kexin Yang and Le Yu and Mei Li and Mingfeng Xue and Pei Zhang and Qin Zhu and Rui Men and Runji Lin and Tianhao Li and Tianyi Tang and Tingyu Xia and Xingzhang Ren and Xuancheng Ren and Yang Fan and Yang Su and Yichang Zhang and Yu Wan and Yuqiong Liu and Zeyu Cui and Zhenru Zhang and Zihan Qiu},
    journal={arXiv preprint arXiv:2412.15115},
    year={2024}
}