Kyutai's Avatar

Kyutai

@kyutai-labs.bsky.social

https://kyutai.org/ Open-Science AI Research Lab based in Paris

510 Followers  |  4 Following  |  28 Posts  |  Joined: 18.11.2024  |  1.5898

Latest posts by kyutai-labs.bsky.social on Bluesky

Available in PyTorch, MLX, on your iPhone, or in Rust for your server needs!
Project Page: kyutai.org/next/stt
OpenASR Leaderboard: huggingface.co/spaces/hf-au...

27.06.2025 10:31 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

Our latest open-source speech-to-text model just claimed 1st place among streaming models and 5th place overall on the OpenASR leaderboard ๐Ÿฅ‡๐ŸŽ™๏ธ
While all other models need the whole audio, ours delivers top-tier accuracy on streaming content.
Open, fast, and ready for production!

27.06.2025 10:31 โ€” ๐Ÿ‘ 4    ๐Ÿ” 3    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Whatโ€™s next? We strongly believe that the future of human-machine interaction lies in natural, full-duplex speech interactions, coupled with customization and extended abilities. Stay tuned for whatโ€™s to come!

23.05.2025 10:14 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Video thumbnail

The text LLMโ€™s response is passed to our TTS, conditioned on a 10s voice sample. Weโ€™ll provide access to the voice cloning model in a controlled way. The TTS is also streaming *in text*, reducing the latency by starting to speak even before the full text response is generated.

23.05.2025 10:14 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Video thumbnail

Unmuteโ€™s speech-to-text is streaming, accurate, and includes a semantic VAD that predicts whether youโ€™ve actually finished speaking or if youโ€™re just pausing mid-sentence, meaning itโ€™s low-latency but doesnโ€™t interrupt you.

23.05.2025 10:14 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

โ€œBut what about Moshi?โ€ While Moshi provides unmatched latency and naturalness, it doesnโ€™t yet match the abilities of text models such as function-calling, stronger reasoning, and in-context learning. Unmute allows us to directly bring all of these from text to real-time voice conversations.

23.05.2025 10:14 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Video thumbnail

Talk to unmute.sh ๐Ÿ”Š, the most modular voice AI around. Empower any text LLM with voice, instantly, by wrapping it with our new speech-to-text and text-to-speech. Any personality, any voice. Interruptible, smart turn-taking. Weโ€™ll open-source everything within the next few weeks.

23.05.2025 10:14 โ€” ๐Ÿ‘ 8    ๐Ÿ” 1    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 2

๐Ÿง‘โ€๐Ÿ’ป Read more about Helium 1 and dactory on our blog: kyutai.org/2025/04/30/h...
๐Ÿค— Get the models on HuggingFace: huggingface.co/kyutai/heliu...
๐Ÿ“š Try our pretraining data pipeline on GitHub: github.com/kyutai-labs/...

05.05.2025 10:39 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

๐Ÿš€ Thrilled to announce Helium 1, our new 2B-parameter LLM, now available alongside dactory, an open-source pipeline to reproduce its training dataset covering all 24 EU official languages. Helium sets new standards within its size class on European languages!

05.05.2025 10:39 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 1
Preview
GitHub - kyutai-labs/moshi-finetune Contribute to kyutai-labs/moshi-finetune development by creating an account on GitHub.

If you have audio data with speaker separated streams ๐Ÿ—ฃ๏ธ๐ŸŽ™๏ธ๐ŸŽค๐Ÿค– head over to github.com/kyutai-labs/moshi-finetune and train your own Moshi! We have already witnessed nice extensions of Moshi like J-Moshi ๐Ÿ‡ฏ๐Ÿ‡ต hope this release will allow more people to create their very own voice AI!

01.04.2025 15:47 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Fine-tuning Moshi only takes a couple hours and can be done on a single GPU thanks to LoRA โšก. The codebase contains an example colab notebook that demonstrates the simplicity and the efficiency of the procedure ๐ŸŽฎ.
๐Ÿ”Ž github.com/kyutai-labs/...

01.04.2025 15:47 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Video thumbnail

Have you enjoyed talking to ๐ŸŸขMoshi and dreamt of making your own speech to speech chat experience๐Ÿง‘โ€๐Ÿ”ฌ๐Ÿค–? It's now possible with the moshi-finetune codebase! Plug your own dataset and change the voice/tone/personality of Moshi ๐Ÿ’š๐Ÿ”Œ๐Ÿ’ฟ. An example after finetuning w/ only 20 hours of the DailyTalk dataset. ๐Ÿงต

01.04.2025 15:47 โ€” ๐Ÿ‘ 6    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 2

If you want to work on cutting-edge research, join our non-profit AI lab in Paris ๐Ÿ‡ซ๐Ÿ‡ท

Thanks to Iliad Group, CMA-CGM Group, Schmidt Sciences โ€” and the open-source community.

21.03.2025 14:39 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Video thumbnail

๐Ÿงฐ Fully open-source

Weโ€™re releasing a preprint, model weights and a benchmark dataset for spoken visual question answering:

๐Ÿ“„ Preprint arxiv.org/abs/2503.15633
๐Ÿง  Dataset huggingface.co/datasets/kyu...
๐Ÿงพ Model weights huggingface.co/kyutai/moshi...
๐Ÿงช Inference code github.com/kyutai-labs/...

21.03.2025 14:39 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Video thumbnail

๐Ÿง  How it works

MoshiVis builds on Moshi, our speech-to-speech LLM โ€” now enhanced with vision.

206M lightweight parameters on top of a frozen Moshi give it the power to discuss images while still remaining real-time on consumer-grade hardware.

21.03.2025 14:39 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Video thumbnail

Try it out ๐Ÿ‘‰ vis.moshi.chat
Blog post ๐Ÿ‘‰ kyutai.org/moshivis

21.03.2025 14:39 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Video thumbnail

Meet MoshiVis๐ŸŽ™๏ธ๐Ÿ–ผ๏ธ, the first open-source real-time speech model that can talk about images!

It sees, understands, and talks about images โ€” naturally, and out loud.

This opens up new applications, from audio description for the visual impaired to visual access to information.

21.03.2025 14:39 โ€” ๐Ÿ‘ 6    ๐Ÿ” 2    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 2
Video thumbnail

Even Kavinsky ๐ŸŽง๐Ÿชฉ can't break Hibiki! Just like Moshi, Hibiki is robust to extreme background conditions ๐Ÿ’ฅ๐Ÿ”Š.

11.02.2025 16:11 โ€” ๐Ÿ‘ 8    ๐Ÿ” 4    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 1
Preview
GitHub - kyutai-labs/hibiki: Hibiki is a model for streaming speech translation (also known as simultaneous translation). Unlike offline translationโ€”where one waits for the end of the source utterance... Hibiki is a model for streaming speech translation (also known as simultaneous translation). Unlike offline translationโ€”where one waits for the end of the source utterance to start translating--- H...

Get the code on github and the weights on huggingface and try it out by yourself: github.com/kyutai-labs/...

07.02.2025 08:22 โ€” ๐Ÿ‘ 4    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Video thumbnail

Hibikiโ€™s smaller alternative, Hibiki-M, runs on-device in real time. Hibiki-M was obtained by distilling the full model into a smaller version with only 1.7B parameters. On an iPhone 16 Pro, Hibiki-M runs in real-time for more than a minute as shown by Tom.

07.02.2025 08:22 โ€” ๐Ÿ‘ 2    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

To train Hibiki, we generated bilingual data of simultaneous interpretation where a word only appears in the target when it's predictable from the source. We developed a new method based on an off-the-shelf text translation system and using a TTS system with constraints on word locations.

07.02.2025 08:22 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Video thumbnail

Based on objective and human evaluations, Hibiki outperforms previous systems for quality, naturalness and speaker similarity and approaches human interpreters.
Here is an example of a live conference interpretation.

07.02.2025 08:22 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Video thumbnail

Meet Hibiki, our simultaneous speech-to-speech translation model, currently supporting ๐Ÿ‡ซ๐Ÿ‡ทโžก๏ธ๐Ÿ‡ฌ๐Ÿ‡ง.
Hibiki produces spoken and text translations of the input speech in real-time, while preserving the speakerโ€™s voice and optimally adapting its pace based on the semantic content of the source speech. ๐Ÿงต

07.02.2025 08:22 โ€” ๐Ÿ‘ 11    ๐Ÿ” 2    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 2
Video thumbnail

Helium 2B running locally on an iPhone 16 Pro at ~28 tok/s, faster than you can read your loga lessons in French ๐Ÿš€ All that thanks to mlx-swift with q4 quantization!

14.01.2025 16:38 โ€” ๐Ÿ‘ 1    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 1

We are looking forward to the feedback from the community, which will help us drive the development of Helium and make it the best multi-lingual lightweight model. Thanks @hf.co for helping us on this release!

13.01.2025 17:51 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

We will also release the full model, a technical report, and we will open-source the code for training the model and for reproducing our dataset.

13.01.2025 17:51 โ€” ๐Ÿ‘ 4    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

Helium currently supports 6 languages (English, French, German, Italian, Portuguese and Spanish) and will be extended to more languages shortly. Here is a summary of Helium's performance on multilingual benchmarks.

13.01.2025 17:50 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
kyutai/helium-1-preview-2b ยท Hugging Face Weโ€™re on a journey to advance and democratize artificial intelligence through open source and open science.

Meet Helium-1 preview, our 2B multi-lingual LLM, targeting edge and mobile devices, released under a CC-BY license. Start building with it today!
huggingface.co/kyutai/heliu...

13.01.2025 17:50 โ€” ๐Ÿ‘ 16    ๐Ÿ” 5    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 5

@kyutai-labs is following 4 prominent accounts