VoxCPM
LibraryOpenBMB/VoxCPM
Tokenizer-free TTS for multilingual speech generation, voice design, and cloning.
Overview
VoxCPM2 is a tokenizer-free text-to-speech system using diffusion autoregressive architecture. It supports 30 languages, voice design from description, controllable voice cloning, and 48kHz audio output. Built on MiniCPM backbone with 2B parameters trained on 2M hours of data.
README Preview
VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life Cloning\n\n\n English | 中文\n\n\n\n \n \n \n \n \n \n \n\n\n\n \n \n \n\n\n\n\n\n 👋 Join our community for discussion and support!\n \n \n Feishu\n \n | \n \n Discord\n \n\n\nVoxCPM is a **tokenizer-free** Text-to-Speech system that directly generates continuous speech representations via an end-to-end **diffusion autoregressive architecture**, bypassing discrete tokenization to achieve highly natural and expressive synthesis.\n\n**VoxCPM2** is the latest major release — a **2B** parameter model trained on **over 2 million hours** of multilingual speech data, now supporting **30 languages**, **Voice Design**, **Controllable Voice Cloning**, and **48kHz** studio-quality audio output. Built on a [MiniCPM-4](https://github.com/OpenBMB/MiniCPM) backbone.\n\n### ✨ Highlights\n\n- 🌍 **30-Language Multilingual** — Input text in any of the 30 supported languages and synthesize directly, no language tag needed\n- 🎨 **Voice Design** — Create a brand-new voice from a natural-language description alone (gender, age, tone, emotion, pace …), no reference audio required\n- 🎛️ **Controllable Cloning** — Clone any voice from a short reference clip, with optional style guidance to steer emotion, pace, and expression while preserving the original timbre\n- 🎙️ **Ultimate Cloning** — Reproduce every vocal nuance: provide both reference audio and its transcript, and the model continues seamlessly from the reference, faithfully preserving every vocal detail — timbre, rhythm, emotion, and style (same as VoxCPM1.5)\n- 🔊 **48kHz High-Quality Audio** — Accepts 16kHz reference audio and directly outputs 48kHz studio-quality audio via AudioVAE V2's asymmetric encode/decode design, with built-in super-resolution — no external upsampler needed\n- 🧠 **Context-Aware Synthesis** — Automatically infers appropriate prosody and expressiveness from text conte
FAQ (5)
TroubleshootingHow to fix 'Dimension out of range' error when running VoxCPM on CPU with PyTorch 2.11?
This is a known bug in PyTorch 2.11.0+ that causes scaled_dot_product_attention to fail with 'Dimension out of range (expected to be in range of [-1, 0], but got -2)' on CPU. Workaround: downgrade PyTorch to a version below 2.11, such as 2.5.1. For CPU-only, install torch==2.5.1 via pip (e.g., pip install torch==2.5.1). For GPU (CUDA 12.1), use torch==2.5.1+cu121. See PyTorch issue #163597 for details.
TroubleshootingWhy does VoxCPM2 crash with CUDA errors (e.g., 'Offset increment outside graph capture') when using multiple subprocess workers on the same GPU?
This is a known instability caused by torch.compile's CUDA graph optimization when multiple processes share a GPU memory pool. The recommended workaround is to use a single-process serving architecture such as nano-vllm-voxcpm (https://github.com/a710128/nanovllm-voxcpm) or vllm-omni (https://github.com/OpenBMB/VoxCPM#-production-serving-vllm-omni), which avoids multi-process CUDA graph conflicts. A production-ready FastAPI wrapper for nano-vllm-voxcpm is available at https://github.com/uttera/uttera-tts-vllm.
TroubleshootingWhy does audio quality progressively degrade when using LoRA fine-tuning with nano-vllm on Blackwell (RTX 5090) GPUs?
This is a known issue caused by CUDA graph memory pool conflicts with LoRA and an object leak in nano-vllm's scheduler on Blackwell (sm_120) architecture. The only effective workaround is to periodically restart the inference process every 2–3 hours, which resets leaked objects and defragments GPU memory. Track issue #326 and nano-vllm-voxcpm #61 for permanent fixes.
TroubleshootingWhy does voxcpm2 voice cloning produce distorted, demon-like output with incorrect audio duration?
This is a known instability in voxcpm2 and voxcpm1.5. As a temporary workaround, switch to voxcpm0.5b, which works correctly with the same inputs. No permanent fix is available yet; monitor the GitHub issue for updates.
TroubleshootingHow to fix 'triton is not installed' warning when using torch.compile?
Install triton version matching your PyTorch. For torch==2.5.1, use triton==3.1.0 (Linux with NVIDIA GPU). Check hardware supports triton (compute capability 7.0+). Windows support is limited; ignore warning if functionality unaffected. To fix: pip install triton==3.1.0. If you installed wrong version (e.g., 2.1.0 caused errors), uninstall it: pip uninstall triton, then install correct one.