March Voice AI Meetup - Wednesday the 5th
lu.ma/ffpyl57n
@kwindla.bsky.social
Low, low, low latency. Daily.co and Pipecat.ai
March Voice AI Meetup - Wednesday the 5th
lu.ma/ffpyl57n
The new model is GA today:
developers.googleblog.com/en/gemini-2-...
Gemini 2.0 Flash is competitive with GPT-4o on:
- TTFT,
- instruction following,
- function calling, and
- natural conversation dynamics.
GPT-4o was ahead on all of these attributes by a wide enough margin that using any other LLM for voice AI mostly didn't make sense. Now there's competition!
Source code is here:
github.com/pipecat-ai/p...
My favorite thing about this demo is that it's a really nice example of composite function calling.
Here are the function definitions. Gemini figures out solely from the argument descriptions how to find a conversation from "a few minutes ago"!
Memory for voice AI agents (and composite function calling) ...
There are several ways to store (and later, retrieve) conversation state. One of the simplest is just to define a couple of functions and use your local filesystem!
Here, @chadbailey.net shows how to do that, using Gemini 2.0 Flash.
The Pion community is vibrant and welcoming, and Sean also wrote the definitive guide to WebRTC, "WebRTC For The Curious."
webrtcforthecurious.com
If you're interested in diving into WebRTC, read Sean's guide and check out the Pion code and forums.
Full video here:
www.youtube.com/watch?v=l_rT...
We talked about:
- WebRTC history.
- Why you probably need to use WebRTC instead of WebSockets for your voice AI application
- How we see things evolving for WebRTC x multimodal AI
- Embedded WebRTC, telephony, and surprising use cases.
Sean DuBois is one of my favorite people to talk to about WebRTC, audio and video, designing good libraries, and hacking in general.
Sean is the creator of Pion. Pion is an Open Source WebRTC implementation that is influential and very widely used (including at OpenAI, where Sean works).
Usually, when you work on a system like this, you never manage to write up all the lessons learned even for your own use, much less publish them in such an accessible paper. Kudos to the DeepSeek team.
lnkd.in/gz8SBvuM
Writing really, really optimized distributed systems code is very satisfying. I've written a lot of both GPU code and networking code over the years, so the overlap here makes me particularly happy!
But my favorite, favorite part is that they also wrote a section, "Suggestions on Hardware Design."
It would be fun to see this code, though the actual implementation itself is tightly coupled enough to the architecture of the H800 and their cluster design โ the specifics of the NVLink and InfiniBand โ that it wouldn't be useful as an open source building block.
30.01.2025 20:46 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0My favorite part of the DeepSeek-V3 Technical Report is the stuff about the all-to-all communication kernels. (Mostly in section 3.2.2. "Efficient Implementation of Cross-Node All-to-All Communication.")
30.01.2025 20:46 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0Using Gemini search metadata in a voice AI application
Filipi added support in Pipecat for Google Gemini's `groundingMetadata`. This makes it easy to do things like:
- link to URLs
- log searches for observability
- use specific search result chunks for RAG
youtu.be/oL9w-3Hbag0
Voice AI programming with Geminiยฒ and Cursor
Adrian built a Gemini voice + vision AI agent that writes software indirectly, collaborating with a human and with Gemini running inside Cursor. Really nice glimpse of the future (and nice example of a "multi-agent" architecture).
youtu.be/0VFZWZfU0vw
Pipecat 0.0.53 release is out today.
31 entries in the Changelog, including:
๐ Frame observers โ for implementing loggers, debuggers, and pipeline tools
๐ Heartbeat frames โ pipeline traversal timing and warnings if system frames get blocked anywhere in the pipeline
github.com/pipecat-ai/p...
Search is a built-in tool in the Gemini Multimodal Live API.
Here's an iOS starter project that shows:
- how to to use the Gemini search built-in tool
- combining the built-in search with custom functions
Here's the code: github.com/pipecat-ai/p...
youtube.com/shorts/7jX7l...
Maslow's hierarchy of voice AI
u are here โคต๏ธ
โป๏ธโป๏ธโป๏ธโป๏ธ๐ฆโป๏ธโป๏ธโป๏ธโป๏ธ
โป๏ธโป๏ธโป๏ธ๐ฆ๐ฆ๐ฆโป๏ธโป๏ธโป๏ธ
โป๏ธโป๏ธ๐ฆ๐ฆ๐ฆ๐ฆ๐ฆโป๏ธโป๏ธ
โป๏ธ๐ฆ๐ฆ๐ฆ๐ฆ๐ฆ๐ฆ๐ฆโป๏ธ
๐ฆ๐ฆ๐ฆ๐ฆ๐ฆ๐ฆ๐ฆ๐ฆ๐ฆ
Network transport โถ๏ธ Turn detection โถ๏ธ Interruption handling โถ๏ธ Natural voices โถ๏ธ Tool use
www.youtube.com/watch?v=tAQW...
"Voice AI in 2025" panel recording from the Voice AI Meetup last night.
Thank you to panelists Karan Goel, Niamh Gavin, Shrestha Basu Mallick, and Swyx.
And thank you to Chroma for hosting the meetup in their fantastic office in SF.
www.youtube.com/live/B6zTwHh...
Christian built a voice AI assistant to control Spotify.
Tech:
- Google Gemini 2.0
- Pipecat
- Deepgram
- Cartesia
Code is here: github.com/pipecat-ai/s...
youtu.be/q6v-3BQem3Y
Gemini Multimodal Live API + iOS + WebRTC
Nice walk-through from Paul: youtu.be/nU3K8h_pkeQ
๐ฃ set up a voice client in your iOS app
๐ถ specify WebSockets or WebRTC for network transport
๐ attach a delegate to handle lifecycle events (for example "connected", "LLM ready")
I listen to a fair amount of bluegrass, and alt country that overlaps with bluegrass, and I love a lot of mainstream country that overlaps with alt country!
Also, of course, the hip-hop and r&b of my youth, and hip-hop and r&b today that reminds me of the hip-hop and r&b of my youth.
Sunday morning listening ... and hacking.
12.01.2025 14:52 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0Today's reminder of how early we are in the generative AI/deep learning technology transition: moved a moderately complex prompt to a different LLM and 150% of my evals broke. 150% because evals I didn't even have (but, obviously, needed) broke, too.
11.01.2025 17:48 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0Oh, wait. I take it back.
10.01.2025 23:15 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0They know what theyโre doing over there in Cupertino (and Shenzhen).
10.01.2025 23:13 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0The Simple Chatbot iOS example code is here:
github.com/pipecat-ai/p...
Clone the repo -> add your API keys to the .env file -> build -> run on your phone!
iOS + Gemini Multimodal Live + WebRTC
Filipi Fuchter added an iOS example to the Pipecat "Simple Chatbot" repo. With the Pipecat iOS SDK, you can build apps that use Gemini Multimodal Live and Gemini Flash with WebRTC, WebSockets, and HTTP networking.
I had a lot of fun talking to Eric Landau about the state of Voice AI at the end of 2024, what's coming in 2025, what the pain points are today if you're scaling voice AI agents in production, and โ of course โ the importance of data tooling and evals.
open.spotify.com/episode/5Fjj...