OpenSource-Hub

dflash

Library

z-lab/dflash

Block diffusion model for efficient speculative decoding in LLMs.

Overview

DFlash is a lightweight block diffusion model designed for speculative decoding in large language models. It enables efficient and high-quality parallel drafting across multiple inference backends including vLLM, SGLang, and Transformers. The project provides pre-trained draft models for various popular LLMs.

README Preview

# DFlash: Block Diffusion for Flash Speculative Decoding\n[**Paper**](https://arxiv.org/abs/2602.06036) | [**Blog**](https://z-lab.ai/projects/dflash/) | [**Models**](https://huggingface.co/collections/z-lab/dflash)\n\n**DFlash** is a lightweight **block diffusion** model designed for speculative decoding. It enables efficient and high-quality parallel drafting.\n\n\n\nhttps://github.com/user-attachments/assets/5b29cabb-eb95-44c9-8ffe-367c0758de8c\n\n## Supported Models\n\n| Model | DFlash Draft |\n|---|---|\n| gemma-4-26B-A4B-it | [z-lab/gemma-4-26B-A4B-it-DFlash](https://huggingface.co/z-lab/gemma-4-26B-A4B-it-DFlash) |\n| gemma-4-31B-it | [z-lab/gemma-4-31B-it-DFlash](https://huggingface.co/z-lab/gemma-4-31B-it-DFlash) |\n| Qwen3.6-27B | [z-lab/Qwen3.6-27B-DFlash](https://huggingface.co/z-lab/Qwen3.6-27B-DFlash) |\n| Qwen3.6-35B-A3B | [z-lab/Qwen3.6-35B-A3B-DFlash](https://huggingface.co/z-lab/Qwen3.6-35B-A3B-DFlash) |\n| MiniMax-M2.5 (Preview) | [z-lab/MiniMax-M2.5-DFlash](https://huggingface.co/z-lab/MiniMax-M2.5-DFlash) |\n| Kimi-K2.5 | [z-lab/Kimi-K2.5-DFlash](https://huggingface.co/z-lab/Kimi-K2.5-DFlash) |\n| Qwen3.5-4B | [z-lab/Qwen3.5-4B-DFlash](https://huggingface.co/z-lab/Qwen3.5-4B-DFlash) |\n| Qwen3.5-9B | [z-lab/Qwen3.5-9B-DFlash](https://huggingface.co/z-lab/Qwen3.5-9B-DFlash) |\n| Qwen3.5-27B | [z-lab/Qwen3.5-27B-DFlash](https://huggingface.co/z-lab/Qwen3.5-27B-DFlash) |\n| Qwen3.5-35B-A3B | [z-lab/Qwen3.5-35B-A3B-DFlash](https://huggingface.co/z-lab/Qwen3.5-35B-A3B-DFlash) |\n| Qwen3.5-122B-A10B | [z-lab/Qwen3.5-122B-A10B-DFlash](https://huggingface.co/z-lab/Qwen3.5-122B-A10B-DFlash) |\n| Qwen3-Coder-Next | [z-lab/Qwen3-Coder-Next-DFlash](https://huggingface.co/z-lab/Qwen3-Coder-Next-DFlash) |\n| Qwen3-Coder-30B-A3B | [z-lab/Qwen3-Coder-30B-A3B-DFlash](https://huggingface.co/z-lab/Qwen3-Coder-30B-A3B-DFlash) |\n| gpt-oss-20b | [z-lab/gpt-oss-20b-DFlash](https://huggingface.co/z-lab/gpt-oss-20b-DFlash) |\n| gpt-oss-120b | [z-lab/gpt-oss-120b-DFlash

FAQ (4)

information
What ablation study compares KV cache injection vs input fusion in DFlash, and which method performs better?

Direct ablation comparing DFlash-inputfusion-5L (feeding fused hidden features as input) and DFlash-5L (KV injection) shows KV injection achieves higher acceptance length and speedup. On GSM8K, KV injection reaches 4.2 AL and 3.3x speedup vs 3.5 AL and 2.9x speedup for input fusion; on HumanEval, 4.0 AL and 3.2x speedup vs 3.5 AL and 2.9x speedup; on MT-Bench, 3.0 AL and 2.2x speedup vs 2.6 AL and 2.0x speedup. KV injection also reduces draft prefill time because target context bypasses full token processing and is directly injected into K/V cache.

GitHub Issue #58
implementation guidance
Can I use DFlash speculative decoding with vision-language models (VLMs) like Qwen3-VL?

Yes, DFlash can be adapted for VLMs. For SGLang, use PR #18387 (adapted from #16818). For vLLM, use PR #36847. Initial tests with Qwen3-VL-8B-Instruct and DFlash-b16 show an average acceptance step length of ~2 even without VLM-specific training. Official DFlash checkpoint for Qwen3-VL is planned after GPT-OSS and GLM-4.7-Flash work is completed.

GitHub Issue #14
Troubleshooting
Why do I get 'CUDA error: an illegal memory access was encountered' when using vLLM with DFlash speculative decoding on a GPTQ model?

This CUDA illegal memory access error (often in cublasGemmEx) occurred in certain vLLM nightly builds around early April 2026. It has been fixed in a later nightly release. Upgrade to the latest vLLM nightly version (post-2026-04-08) to resolve the issue. If the error persists, also ensure you are using a compatible NVIDIA driver and CUDA version (e.g., CUDA 13.0+).

GitHub Issue #51
Troubleshooting
How to fix 'CUDA error: an illegal memory access' when using DFlash speculative decoding on A6000?

This sporadic crash, occurring at dflash_worker_v2.py:335, is a known issue when using the DFlash speculative decoding feature on Ampere GPUs (SM86, e.g., A6000) with the flashinfer backend. The maintainers believe it has been fixed in the latest commit of PR #20547. Update your SGLang installation to pull the latest changes: pip install -e git+https://github.com/sgl-project/sglang.git@refs/pull/20547/head#egg=sglang and re-run. If the issue persists, enable synchronous CUDA launches with CUDA_LAUNCH_BLOCKING=1 to identify the exact offending kernel, and ensure flashinfer is compatible with your SM architecture.

GitHub Issue #38