z-image-turbo

Z-Image Turbo

Z-Image-Turbo is a 6B-parameter text-to-image diffusion model distilled from the Z-Image foundation model, producing high-fidelity images in only 8 NFEs (Number of Function Evaluations). It delivers sub-second inference latency on H800-class GPUs and fits within 16 GB of VRAM on consumer hardware, while preserving strong photorealism, bilingual (English/Chinese) text rendering, and reliable instruction following.

Apache-2.0
Text-to-Image
Safetensors
Diffusers
English
by @AIOZAI
1
0

Last updated: 7 days ago


Generic badge Generic badge

Z-Image-Turbo

Highlights

Z-Image-Turbo is a 6B-parameter text-to-image diffusion model distilled from the Z-Image foundation model, producing high-fidelity images in only 8 NFEs (Number of Function Evaluations). It delivers sub-second inference latency on H800-class GPUs and fits within 16 GB of VRAM on consumer hardware, while preserving strong photorealism, bilingual (English/Chinese) text rendering, and reliable instruction following.

Summary

Introduction

Z-Image is a family of efficient text-to-image generation models built around a 6B-parameter backbone. The family is currently composed of four variants:

  • Z-Image-Turbo: a distilled variant of Z-Image that matches or surpasses leading competitors in only 8 NFEs. It achieves sub-second inference latency on enterprise-grade H800 GPUs, fits within 16 GB of VRAM on consumer devices, and is optimized for photorealistic image generation, bilingual text rendering (English and Chinese), and robust instruction adherence.
  • Z-Image: the foundation model underlying Z-Image-Turbo. It targets high-quality generation, rich aesthetics, strong diversity, and controllability, making it well-suited for creative generation, fine-tuning, and downstream development. It supports a wide range of artistic styles, effective negative prompting, and high diversity across identities, poses, compositions, and layouts.
  • Z-Image-Omni-Base: the versatile foundation checkpoint that supports both generation and editing tasks. It is released to provide the most general and diverse starting point for community-driven fine-tuning and custom development.
  • Z-Image-Edit: a variant fine-tuned from Z-Image for image editing. It supports creative image-to-image generation with strong instruction-following capabilities, enabling precise edits driven by natural language prompts.

Model Zoo

ModelPre-TrainingSFTRLStepCFGTaskVisual QualityDiversityFine-TunabilityHugging FaceModelScope
Z-Image-Omni-Base50Gen. / EditingMediumHighEasyTo be releasedTo be released
Z-Image50Gen.HighMediumEasyHugging Face
Hugging Face Space
ModelScope Model
ModelScope Space
Z-Image-Turbo8Gen.Very HighLowN/AHugging Face
Hugging Face Space
ModelScope Model
ModelScope Space
Z-Image-Edit50EditingHighMediumEasyTo be releasedTo be released

Model Architecture

Z-Image adopts a Scalable Single-Stream DiT (S3-DiT) architecture. Text tokens, visual semantic tokens, and image VAE tokens are concatenated at the sequence level and processed as a single unified input stream, which maximizes parameter efficiency relative to dual-stream designs.

Z-Image-Turbo is derived from the base Z-Image model through a few-step distillation scheme combined with reinforcement learning post-training, reducing inference to 8 function evaluations while preserving the visual quality of the 50-step base model.

Parameters

Inputs

  • prompt (text): A natural language description of the image to be generated. It may include details about the scene, objects, style, lighting, and overall composition.
  • height (integer): The height of the generated image in pixels.
  • width (integer): The width of the generated image in pixels.
  • steps (integer): The number of inference steps used during image generation. Higher values typically improve image quality at the cost of increased generation time. For Z-Image-Turbo, the recommended value is 8.
  • guidance_scale (float): Controls how closely the generated image follows the prompt. Higher values increase prompt alignment but may reduce creativity. For Z-Image-Turbo, classifier-free guidance is disabled and this value should be set to 0.

Output

  • output_image (image, .png): The generated image in PNG format, produced from the supplied prompt and generation parameters.

Usage for developers

The following sections describe how to access and run the model on our platform.

Requirements

pip install -r requirements.txt

Code based on AIOZ structure

import os
import torch
from diffusers import ZImagePipeline

...


def do_ai_task(
        prompt: str,
        height: int,
        width: int,
        steps: int,
        guidance_scale: float,
        model_storage_directory: Union[str, Path],
        device: Literal["cpu", "cuda", "gpu"] = "cpu",
        *args, **kwargs) -> Any:
    """Define AI task: load model, pre-process, post-process, etc."""
    # Define AI task workflow. Below is an example.

    model_weight_path = os.path.join(model_storage_directory, "Z-Image-Turbo")

    # 1. Load the pipeline.
    # Use bfloat16 for optimal performance on supported GPUs.
    pipe = ZImagePipeline.from_pretrained(
        model_weight_path,
        torch_dtype=torch.bfloat16,
        low_cpu_mem_usage=False,
    )
    pipe.to(device)

    ...

    image.save("output_image.png")
    output_image = open("output_image.png", "rb")  # io.BufferedReader
    return output_image

Reference

This repository is based on and inspired by Tongyi-MAI's work. We sincerely appreciate their generosity in sharing the code and model weights.

License

We respect and comply with the terms of the author's license cited in the Reference section.

Citation

@article{team2025zimage,
  title={Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer},
  author={Z-Image Team},
  journal={arXiv preprint arXiv:2511.22699},
  year={2025}
}