Z-Image-Turbo

Highlights

Z-Image-Turbo is a 6B-parameter text-to-image diffusion model distilled from the Z-Image foundation model, producing high-fidelity images in only 8 NFEs (Number of Function Evaluations). It delivers sub-second inference latency on H800-class GPUs and fits within 16 GB of VRAM on consumer hardware, while preserving strong photorealism, bilingual (English/Chinese) text rendering, and reliable instruction following.

Introduction

Z-Image is a family of efficient text-to-image generation models built around a 6B-parameter backbone. The family is currently composed of four variants:

Z-Image-Turbo: a distilled variant of Z-Image that matches or surpasses leading competitors in only 8 NFEs. It achieves sub-second inference latency on enterprise-grade H800 GPUs, fits within 16 GB of VRAM on consumer devices, and is optimized for photorealistic image generation, bilingual text rendering (English and Chinese), and robust instruction adherence.
Z-Image: the foundation model underlying Z-Image-Turbo. It targets high-quality generation, rich aesthetics, strong diversity, and controllability, making it well-suited for creative generation, fine-tuning, and downstream development. It supports a wide range of artistic styles, effective negative prompting, and high diversity across identities, poses, compositions, and layouts.
Z-Image-Omni-Base: the versatile foundation checkpoint that supports both generation and editing tasks. It is released to provide the most general and diverse starting point for community-driven fine-tuning and custom development.
Z-Image-Edit: a variant fine-tuned from Z-Image for image editing. It supports creative image-to-image generation with strong instruction-following capabilities, enabling precise edits driven by natural language prompts.

Model Zoo

Model	Pre-Training	SFT	RL	Step	CFG	Task	Visual Quality	Diversity	Fine-Tunability	Hugging Face	ModelScope
Z-Image-Omni-Base	✅	❌	❌	50	✅	Gen. / Editing	Medium	High	Easy	To be released	To be released
Z-Image	✅	✅	❌	50	✅	Gen.	High	Medium	Easy
Z-Image-Turbo	✅	✅	✅	8	❌	Gen.	Very High	Low	N/A
Z-Image-Edit	✅	✅	❌	50	✅	Editing	High	Medium	Easy	To be released	To be released

Model Architecture

Z-Image adopts a Scalable Single-Stream DiT (S3-DiT) architecture. Text tokens, visual semantic tokens, and image VAE tokens are concatenated at the sequence level and processed as a single unified input stream, which maximizes parameter efficiency relative to dual-stream designs.

Z-Image-Turbo is derived from the base Z-Image model through a few-step distillation scheme combined with reinforcement learning post-training, reducing inference to 8 function evaluations while preserving the visual quality of the 50-step base model.

Parameters

Inputs

prompt (text): A natural language description of the image to be generated. It may include details about the scene, objects, style, lighting, and overall composition.
height (integer): The height of the generated image in pixels.
width (integer): The width of the generated image in pixels.
steps (integer): The number of inference steps used during image generation. Higher values typically improve image quality at the cost of increased generation time. For Z-Image-Turbo, the recommended value is 8.
guidance_scale (float): Controls how closely the generated image follows the prompt. Higher values increase prompt alignment but may reduce creativity. For Z-Image-Turbo, classifier-free guidance is disabled and this value should be set to 0.

Output

output_image (image, .png): The generated image in PNG format, produced from the supplied prompt and generation parameters.

Usage for developers

The following sections describe how to access and run the model on our platform.

Requirements

pip install -r requirements.txt

Code based on AIOZ structure

import os
import torch
from diffusers import ZImagePipeline

...


def do_ai_task(
        prompt: str,
        height: int,
        width: int,
        steps: int,
        guidance_scale: float,
        model_storage_directory: Union[str, Path],
        device: Literal["cpu", "cuda", "gpu"] = "cpu",
        *args, **kwargs) -> Any:
    """Define AI task: load model, pre-process, post-process, etc."""
    # Define AI task workflow. Below is an example.

    model_weight_path = os.path.join(model_storage_directory, "Z-Image-Turbo")

    # 1. Load the pipeline.
    # Use bfloat16 for optimal performance on supported GPUs.
    pipe = ZImagePipeline.from_pretrained(
        model_weight_path,
        torch_dtype=torch.bfloat16,
        low_cpu_mem_usage=False,
    )
    pipe.to(device)

    ...

    image.save("output_image.png")
    output_image = open("output_image.png", "rb")  # io.BufferedReader
    return output_image

Reference

This repository is based on and inspired by Tongyi-MAI's work. We sincerely appreciate their generosity in sharing the code and model weights.

License

We respect and comply with the terms of the author's license cited in the Reference section.

Citation

@article{team2025zimage,
  title={Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer},
  author={Z-Image Team},
  journal={arXiv preprint arXiv:2511.22699},
  year={2025}
}