Available in PyTorch, MLX, on your iPhone, or in Rust for your server needs!
Project Page: kyutai.org/next/stt
OpenASR Leaderboard: huggingface.co/spaces/hf-au...
@kyutai-labs.bsky.social
https://kyutai.org/ Open-Science AI Research Lab based in Paris
Available in PyTorch, MLX, on your iPhone, or in Rust for your server needs!
Project Page: kyutai.org/next/stt
OpenASR Leaderboard: huggingface.co/spaces/hf-au...
Our latest open-source speech-to-text model just claimed 1st place among streaming models and 5th place overall on the OpenASR leaderboard ๐ฅ๐๏ธ
While all other models need the whole audio, ours delivers top-tier accuracy on streaming content.
Open, fast, and ready for production!
Whatโs next? We strongly believe that the future of human-machine interaction lies in natural, full-duplex speech interactions, coupled with customization and extended abilities. Stay tuned for whatโs to come!
23.05.2025 10:14 โ ๐ 2 ๐ 0 ๐ฌ 0 ๐ 0The text LLMโs response is passed to our TTS, conditioned on a 10s voice sample. Weโll provide access to the voice cloning model in a controlled way. The TTS is also streaming *in text*, reducing the latency by starting to speak even before the full text response is generated.
23.05.2025 10:14 โ ๐ 2 ๐ 0 ๐ฌ 1 ๐ 0Unmuteโs speech-to-text is streaming, accurate, and includes a semantic VAD that predicts whether youโve actually finished speaking or if youโre just pausing mid-sentence, meaning itโs low-latency but doesnโt interrupt you.
23.05.2025 10:14 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0โBut what about Moshi?โ While Moshi provides unmatched latency and naturalness, it doesnโt yet match the abilities of text models such as function-calling, stronger reasoning, and in-context learning. Unmute allows us to directly bring all of these from text to real-time voice conversations.
23.05.2025 10:14 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0Talk to unmute.sh ๐, the most modular voice AI around. Empower any text LLM with voice, instantly, by wrapping it with our new speech-to-text and text-to-speech. Any personality, any voice. Interruptible, smart turn-taking. Weโll open-source everything within the next few weeks.
23.05.2025 10:14 โ ๐ 8 ๐ 1 ๐ฌ 2 ๐ 2๐งโ๐ป Read more about Helium 1 and dactory on our blog: kyutai.org/2025/04/30/h...
๐ค Get the models on HuggingFace: huggingface.co/kyutai/heliu...
๐ Try our pretraining data pipeline on GitHub: github.com/kyutai-labs/...
๐ Thrilled to announce Helium 1, our new 2B-parameter LLM, now available alongside dactory, an open-source pipeline to reproduce its training dataset covering all 24 EU official languages. Helium sets new standards within its size class on European languages!
05.05.2025 10:39 โ ๐ 3 ๐ 0 ๐ฌ 1 ๐ 1If you have audio data with speaker separated streams ๐ฃ๏ธ๐๏ธ๐ค๐ค head over to github.com/kyutai-labs/moshi-finetune and train your own Moshi! We have already witnessed nice extensions of Moshi like J-Moshi ๐ฏ๐ต hope this release will allow more people to create their very own voice AI!
01.04.2025 15:47 โ ๐ 2 ๐ 0 ๐ฌ 0 ๐ 0Fine-tuning Moshi only takes a couple hours and can be done on a single GPU thanks to LoRA โก. The codebase contains an example colab notebook that demonstrates the simplicity and the efficiency of the procedure ๐ฎ.
๐ github.com/kyutai-labs/...
Have you enjoyed talking to ๐ขMoshi and dreamt of making your own speech to speech chat experience๐งโ๐ฌ๐ค? It's now possible with the moshi-finetune codebase! Plug your own dataset and change the voice/tone/personality of Moshi ๐๐๐ฟ. An example after finetuning w/ only 20 hours of the DailyTalk dataset. ๐งต
01.04.2025 15:47 โ ๐ 6 ๐ 1 ๐ฌ 1 ๐ 2If you want to work on cutting-edge research, join our non-profit AI lab in Paris ๐ซ๐ท
Thanks to Iliad Group, CMA-CGM Group, Schmidt Sciences โ and the open-source community.
๐งฐ Fully open-source
Weโre releasing a preprint, model weights and a benchmark dataset for spoken visual question answering:
๐ Preprint arxiv.org/abs/2503.15633
๐ง  Dataset huggingface.co/datasets/kyu...
๐งพ Model weights huggingface.co/kyutai/moshi...
๐งช Inference code github.com/kyutai-labs/...
๐ง  How it works
MoshiVis builds on Moshi, our speech-to-speech LLM โ now enhanced with vision.
206M lightweight parameters on top of a frozen Moshi give it the power to discuss images while still remaining real-time on consumer-grade hardware.
Try it out ๐ vis.moshi.chat
Blog post ๐ kyutai.org/moshivis
Meet MoshiVis๐๏ธ๐ผ๏ธ, the first open-source real-time speech model that can talk about images!
It sees, understands, and talks about images โ naturally, and out loud.
This opens up new applications, from audio description for the visual impaired to visual access to information.
Even Kavinsky ๐ง๐ชฉ can't break Hibiki! Just like Moshi, Hibiki is robust to extreme background conditions ๐ฅ๐.
11.02.2025 16:11 โ ๐ 8 ๐ 4 ๐ฌ 0 ๐ 1Get the code on github and the weights on huggingface and try it out by yourself: github.com/kyutai-labs/...
07.02.2025 08:22 โ ๐ 4 ๐ 0 ๐ฌ 0 ๐ 0Hibikiโs smaller alternative, Hibiki-M, runs on-device in real time. Hibiki-M was obtained by distilling the full model into a smaller version with only 1.7B parameters. On an iPhone 16 Pro, Hibiki-M runs in real-time for more than a minute as shown by Tom.
07.02.2025 08:22 โ ๐ 2 ๐ 1 ๐ฌ 1 ๐ 0To train Hibiki, we generated bilingual data of simultaneous interpretation where a word only appears in the target when it's predictable from the source. We developed a new method based on an off-the-shelf text translation system and using a TTS system with constraints on word locations.
07.02.2025 08:22 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0Based on objective and human evaluations, Hibiki outperforms previous systems for quality, naturalness and speaker similarity and approaches human interpreters.
Here is an example of a live conference interpretation.
Meet Hibiki, our simultaneous speech-to-speech translation model, currently supporting ๐ซ๐ทโก๏ธ๐ฌ๐ง.
Hibiki produces spoken and text translations of the input speech in real-time, while preserving the speakerโs voice and optimally adapting its pace based on the semantic content of the source speech. ๐งต
Helium 2B running locally on an iPhone 16 Pro at ~28 tok/s, faster than you can read your loga lessons in French ๐ All that thanks to mlx-swift with q4 quantization!
14.01.2025 16:38 โ ๐ 1 ๐ 1 ๐ฌ 0 ๐ 1We are looking forward to the feedback from the community, which will help us drive the development of Helium and make it the best multi-lingual lightweight model. Thanks @hf.co for helping us on this release!
13.01.2025 17:51 โ ๐ 2 ๐ 0 ๐ฌ 0 ๐ 0We will also release the full model, a technical report, and we will open-source the code for training the model and for reproducing our dataset.
13.01.2025 17:51 โ ๐ 4 ๐ 0 ๐ฌ 1 ๐ 0Helium currently supports 6 languages (English, French, German, Italian, Portuguese and Spanish) and will be extended to more languages shortly. Here is a summary of Helium's performance on multilingual benchmarks.
13.01.2025 17:50 โ ๐ 2 ๐ 0 ๐ฌ 1 ๐ 0Meet Helium-1 preview, our 2B multi-lingual LLM, targeting edge and mobile devices, released under a CC-BY license. Start building with it today!
huggingface.co/kyutai/heliu...