Question 1

What ablation study compares KV cache injection vs input fusion in DFlash, and which method performs better?

Accepted Answer

Direct ablation comparing DFlash-inputfusion-5L (feeding fused hidden features as input) and DFlash-5L (KV injection) shows KV injection achieves higher acceptance length and speedup. On GSM8K, KV injection reaches 4.2 AL and 3.3x speedup vs 3.5 AL and 2.9x speedup for input fusion; on HumanEval, 4.0 AL and 3.2x speedup vs 3.5 AL and 2.9x speedup; on MT-Bench, 3.0 AL and 2.2x speedup vs 2.6 AL and 2.0x speedup. KV injection also reduces draft prefill time because target context bypasses full token processing and is directly injected into K/V cache.

Question 2

Can I use DFlash speculative decoding with vision-language models (VLMs) like Qwen3-VL?

Accepted Answer

Yes, DFlash can be adapted for VLMs. For SGLang, use PR #18387 (adapted from #16818). For vLLM, use PR #36847. Initial tests with Qwen3-VL-8B-Instruct and DFlash-b16 show an average acceptance step length of ~2 even without VLM-specific training. Official DFlash checkpoint for Qwen3-VL is planned after GPT-OSS and GLM-4.7-Flash work is completed.

Question 3

Why do I get 'CUDA error: an illegal memory access was encountered' when using vLLM with DFlash speculative decoding on a GPTQ model?

Accepted Answer

This CUDA illegal memory access error (often in cublasGemmEx) occurred in certain vLLM nightly builds around early April 2026. It has been fixed in a later nightly release. Upgrade to the latest vLLM nightly version (post-2026-04-08) to resolve the issue. If the error persists, also ensure you are using a compatible NVIDIA driver and CUDA version (e.g., CUDA 13.0+).

Question 4

How to fix 'CUDA error: an illegal memory access' when using DFlash speculative decoding on A6000?

Accepted Answer

This sporadic crash, occurring at dflash_worker_v2.py:335, is a known issue when using the DFlash speculative decoding feature on Ampere GPUs (SM86, e.g., A6000) with the flashinfer backend. The maintainers believe it has been fixed in the latest commit of PR #20547. Update your SGLang installation to pull the latest changes: `pip install -e git+https://github.com/sgl-project/sglang.git@refs/pull/20547/head#egg=sglang` and re-run. If the issue persists, enable synchronous CUDA launches with `CUDA_LAUNCH_BLOCKING=1` to identify the exact offending kernel, and ensure flashinfer is compatible with your SM architecture.

dflash

Overview

README Preview

FAQ (4)

同类型项目

puppeteer

PaddleOCR

crawl4ai

supervision