首页
看点啥
插画图片
首页 热点时事 DiffusionGemma:开发者指引

DiffusionGemma:开发者指引

2026-06-11 0

Following our announcement in our launch blog post, we are sharing this developer guide to help you understand, serve and customize this experimental model.

Built on the Gemma 4 backbone, DiffusionGemma introduces several milestones for developer workflows:

  1. Compute-bound parallel generation: Bypasses memory-bandwidth limitations by shifting the bottleneck to compute, delivering up to 4x faster token generation on GPUs (up to 700+ tokens per second on NVIDIA GeForce RTX 5090 and 1000+ tokens per second on a single NVIDIA H100).
  2. Bidirectional context & self-correction: Uses bidirectional attention to evaluate the entire text block simultaneously during generation, enabling real-time error correction and parallel context propagation.
  3. Developer-friendly sizes: Designed as a 26B Mixture of Experts (MoE) model that activates only 3.8B parameters during inference, allowing quantized deployment within 18 GB VRAM limits.

The Architecture

For developers building with traditional LLMs on GPUs, the primary bottleneck is memory bandwidth. Autoregressive language models must repeatedly load model weights from memory to generate text one token at a time. DiffusionGemma bypasses this limitation by shifting the bottleneck from memory bandwidth to compute, generating and refining a 256-token canvas in parallel. By providing the GPU with a large parallel workload, it utilizes tensor cores that would otherwise sit idle during local serving.

Showcase: Solving Sudoku with Parallel Denoising

Traditional autoregressive models struggle with strict, multivariable constrained problems like Sudoku. Because they generate text strictly from left to right, they cannot evaluate future placeholders or backtrack.

To demonstrate customization of DiffusionGemma, we are releasing a fine-tuning recipe and results using Hackable Diffusion, a modular JAX research toolbox. This training setup focuses on a classic multi-variable grid task: the Sudoku Solver.

Why Sudoku is Interesting for Diffusion

In an 81-character Sudoku string representation (where empty cells are marked with periods), every digit is bound by strict intersecting horizontal, vertical, and 9x9 grid constraints.

Bidirectional Context Propagation: Unlike autoregressive models, DiffusionGemma’s denoising step allows every canvas query to attend to all positions in parallel. Information flows symmetrically across the board, resolving global dependencies in each step.

Left: DiffusionGemma generating Sudoku output. The base model is unable to solve the Sudoku after 48 steps. Right: Fine-tuned (SFT) DiffusionGemma solves the puzzle after 12 steps. It is able to complete early thanks to adaptive stopping.

The Performance Impact: While the base DiffusionGemma model is not specifically trained to solve Sudoku puzzles (~0% success rate), applying the simple JAX SFT recipe on a Sudoku dataset raises correctness to 80% success, while decreasing the overall inference step count.

Block Autoregressive Denoising

To enable block autoregressive denoising, DiffusionGemma alternates between incremental prefill and denoising during inference:

This architectural choice makes the following possible:

Serving DiffusionGemma

To serve this experimental architecture efficiently, we worked with the vLLM team to implement DiffusionGemma into vLLM. This integration allows the engine to run the iterative parallel denoising loops efficiently across batched request streams.

Developers can deploy DiffusionGemma out of the box using vLLM's standard OpenAI-compatible local server.

vllm serve google/diffusiongemma-26B-A4B-it --max-model-len 262144 --max-num-seqs 4 --gpu-memory-utilization 0.85 --attention-backend TRITON_ATTN --generation-config vllm --hf-overrides '{"diffusion_sampler": "entropy_bound", "diffusion_entropy_bound": 0.1}' --diffusion-config '{"canvas_length": 256}' --enable-chunked-prefillShell

Getting Started Today

Ready to explore the frontier of non-autoregressive text generation? Take a look at the following resources to find out more:

喜欢(0)

上一篇

DeepSeek 梁文锋当年拿高考状元照片曝光:过了清华线但报了浙大

DeepSeek 梁文锋当年拿高考状元照片曝光:过了清华线但报了浙大

下一篇

UIUC:Meta:斯坦福解读Claude Code爆火后Agent Harness底层逻辑

UIUC:Meta:斯坦福解读Claude Code爆火后Agent Harness底层逻辑
猜你喜欢