OpenSource-Hub

VibeVoice

Framework

microsoft/VibeVoice

Open-source frontier voice AI models for TTS and ASR.

Overview

VibeVoice is a family of open-source frontier voice AI models from Microsoft, including Text-to-Speech and Automatic Speech Recognition models. It features long-form audio support, multi-speaker generation, and multilingual capabilities, with innovations in continuous speech tokenization and next-token diffusion.

README Preview

\n\n## 🎙️ VibeVoice: Open-Source Frontier Voice AI\n[](https://microsoft.github.io/VibeVoice)\n[](https://huggingface.co/collections/microsoft/vibevoice-68a2ef24a875c44be47b034f)\n[](https://openreview.net/pdf?id=FihSkzyxdv)\n[](https://arxiv.org/pdf/2601.18184)\n[](https://colab.research.google.com/github/microsoft/VibeVoice/blob/main/demo/VibeVoice_colab.ipynb)\n[](https://aka.ms/vibevoice-asr)\n\n[](https://trendshift.io/repositories/15465)\n\n\n\n\n\n\n  \n  \n\n\n\n\n\n📰 News\n\n\n\n\n2026-03-06: 🚀 VibeVoice ASR is now part of a Transformers release! You can now use our speech recognition model directly through the Hugging Face Transformers library for seamless integration into your projects.\n\n2026-01-21: 📣 We open-sourced VibeVoice-ASR, a unified speech-to-text model designed to handle 60-minute long-form audio in a single pass, generating structured transcriptions containing Who (Speaker), When (Timestamps), and What (Content), with support for User-Customized Context. Try it in [Playground](https://aka.ms/vibevoice-asr).\n- ⭐️ VibeVoice-ASR is natively multilingual, supporting over 50 languages — check the [supported languages](docs/vibevoice-asr.md#language-distribution) for details.\n- 🔥 The VibeVoice-ASR [finetuning code](finetuning-asr/README.md) is now available!\n- ⚡️ **vLLM inference** is now supported for faster inference; see [vllm-asr](docs/vibevoice-vllm-asr.md) for more details.\n- 📑 [VibeVoice-ASR Technique Report](https://arxiv.org/pdf/2601.18184) is available.\n\n2025-12-16: 📣 We added experimental speakers to VibeVoice‑Realtime‑0.5B for exploration, including multilingual voices in nine languages (DE, FR, IT, JP, KR, NL, PL, PT, ES) and 11 distinct English style voices. [Try it](docs/vibevoice-realtime-0.5b.md#optional-more-experimental-voices). More speaker types will be added over time.\n\n2025-12-03: 📣 We open-sourced VibeVoice‑Realtime‑0.5B, a real‑time text‑to‑speech model that supports streaming text input and robust long-form