项目简介
VoxCPM2 是无分词器的文本转语音系统,采用扩散自回归架构。支持30种语言、语音设计、可控语音克隆及48kHz音频输出。基于MiniCPM骨干网络,20亿参数在200万小时数据上训练。
README 预览
VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life Cloning\n\n\n English | 中文\n\n\n\n \n \n \n \n \n \n \n\n\n\n \n \n \n\n\n\n\n\n 👋 Join our community for discussion and support!\n \n \n Feishu\n \n | \n \n Discord\n \n\n\nVoxCPM is a **tokenizer-free** Text-to-Speech system that directly generates continuous speech representations via an end-to-end **diffusion autoregressive architecture**, bypassing discrete tokenization to achieve highly natural and expressive synthesis.\n\n**VoxCPM2** is the latest major release — a **2B** parameter model trained on **over 2 million hours** of multilingual speech data, now supporting **30 languages**, **Voice Design**, **Controllable Voice Cloning**, and **48kHz** studio-quality audio output. Built on a [MiniCPM-4](https://github.com/OpenBMB/MiniCPM) backbone.\n\n### ✨ Highlights\n\n- 🌍 **30-Language Multilingual** — Input text in any of the 30 supported languages and synthesize directly, no language tag needed\n- 🎨 **Voice Design** — Create a brand-new voice from a natural-language description alone (gender, age, tone, emotion, pace …), no reference audio required\n- 🎛️ **Controllable Cloning** — Clone any voice from a short reference clip, with optional style guidance to steer emotion, pace, and expression while preserving the original timbre\n- 🎙️ **Ultimate Cloning** — Reproduce every vocal nuance: provide both reference audio and its transcript, and the model continues seamlessly from the reference, faithfully preserving every vocal detail — timbre, rhythm, emotion, and style (same as VoxCPM1.5)\n- 🔊 **48kHz High-Quality Audio** — Accepts 16kHz reference audio and directly outputs 48kHz studio-quality audio via AudioVAE V2's asymmetric encode/decode design, with built-in super-resolution — no external upsampler needed\n- 🧠 **Context-Aware Synthesis** — Automatically infers appropriate prosody and expressiveness from text conte
常见问题 (5)
故障排除如何修复在 PyTorch 2.11 的 CPU 上运行 VoxCPM 时出现的 'Dimension out of range' 错误?
这是 PyTorch 2.11.0+ 中的一个已知错误,会导致 scaled_dot_product_attention 在 CPU 上失败,报错 'Dimension out of range (expected to be in range of [-1, 0], but got -2)'。解决方法:将 PyTorch 降级到 2.11 以下版本,例如 2.5.1。仅 CPU 版本可使用 pip 安装 torch==2.5.1(例如 pip install torch==2.5.1)。GPU(CUDA 12.1)版本使用 torch==2.5.1+cu121。详情参见 PyTorch issue #163597。
故障排除为什么VoxCPM2在同一个GPU上使用多个子进程工作器时会因CUDA错误(例如“Offset increment outside graph capture”)而崩溃?
这是由 torch.compile 的 CUDA 图优化在多进程共享GPU内存池时引发的已知不稳定问题。推荐的解决方法是采用单进程服务架构,例如 nano-vllm-voxcpm (https://github.com/a710128/nanovllm-voxcpm) 或 vllm-omni (https://github.com/OpenBMB/VoxCPM#-production-serving-vllm-omni),这可以避免多进程CUDA图冲突。nano-vllm-voxcpm 的生产级 FastAPI 封装可在 https://github.com/uttera/uttera-tts-vllm 获取。
故障排除为什么在Blackwell(RTX 5090)GPU上使用nano-vllm进行LoRA微调时,音频质量会逐渐下降?
这是一个已知问题,由CUDA图内存池与LoRA的冲突以及nano-vllm调度器在Blackwell (sm_120)架构上的对象泄漏引起。唯一有效的解决方法是每隔2-3小时重启推理进程,这可以重置泄漏对象并整理GPU内存碎片。请关注issue #326和nano-vllm-voxcpm #61以获取永久性修复。
故障排除为什么voxcpm2语音克隆会产生扭曲、恶魔般的输出且音频时长不正确?
这是voxcpm2和voxcpm1.5中已知的不稳定问题。临时解决方案:切换到voxcpm0.5b,该版本在相同输入下能正常工作。目前尚无永久修复方案,请关注GitHub issue以获取更新。
故障排除如何修复使用torch.compile时出现的“triton is not installed”警告?
安装与你的 PyTorch 版本匹配的 triton。对于 torch==2.5.1,请使用 triton==3.1.0(Linux 系统且配备 NVIDIA GPU)。检查硬件是否支持 triton(计算能力 7.0 或更高)。Windows 支持有限;若功能不受影响可忽略警告。修复方法:pip install triton==3.1.0。如果安装了错误版本(例如 2.1.0 导致错误),请先卸载:pip uninstall triton,然后安装正确版本。