's Avatar

@siddhant-arora.bsky.social

58 Followers  |  7 Following  |  14 Posts  |  Joined: 22.11.2024  |  1.9376

Latest posts by siddhant-arora.bsky.social on Bluesky

πŸ”— Resources for ESPnet-SDS:
πŸ“‚ Codebase (part of ESPnet): github.com/espnet/espnet
πŸ“– README & User Guide: github.com/espnet/espne...
πŸŽ₯ Demo Video: www.youtube.com/watch?v=kI_D...

17.03.2025 14:29 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

This was joint work with my co-authors at
@ltiatcmu.bsky.social , Sony Japan and Hugging Face ( @shinjiw.bsky.social @pengyf.bsky.social @jiatongs.bsky.social @wanchichen.bsky.social @shikharb.bsky.social @emonosuke.bsky.social @cromz22.bsky.social @reach-vb.hf.co @wavlab.bsky.social ).

17.03.2025 14:29 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

ESPnet-SDS provides:
βœ… Unified Web UI with support for both cascaded & E2E models
βœ… Real-time evaluation of latency, semantic coherence, audio quality & more
βœ… Mechanism for collecting user feedback
βœ… Open-source with modular code -> could easily incorporate new systems!

17.03.2025 14:29 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Spoken dialogue systems (SDS) are everywhere, with many new systems emerging.
But evaluating and comparing them is challenging:
❌ No standardized interfaceβ€”different frontends & backends
❌ Complex and inconsistent evaluation metrics
ESPnet-SDS aims to fix this!

17.03.2025 14:29 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

New #NAACL2025 demo, Excited to introduce ESPnet-SDS, a new open-source toolkit for building unified web interfaces for both cascaded & end-to-end spoken dialogue system, providing real-time evaluation, and more!
πŸ“œ: arxiv.org/abs/2503.08533
Live Demo: huggingface.co/spaces/Siddh...

17.03.2025 14:29 β€” πŸ‘ 7    πŸ” 5    πŸ’¬ 1    πŸ“Œ 0

This work was done during my internship at Apple with Zhiyun Lu, Chung-Cheng Chiu, @ruomingpang.bsky.social
along with co-authors @ltiatcmu.bsky.social ( @shinjiw.bsky.social @wavlab.bsky.social).

(9/9)

05.03.2025 16:03 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Preview
[WIP] Add code for training turn taking prediction model by siddhu001 Β· Pull Request #5948 Β· espnet/espnet What? Implemented code for training and inference of the encoder + classifier model. Added data preparation and evaluation for training the turn-taking model using the Switchboard dataset. Develop...

GPT-4o rarely backchannels and interrupt but shows strong turn-taking capabilities. More analyses of audio FM ability to understand and predict turn taking events can be found in our full paper.

We’re open-sourcing our evaluation platform: github.com/espnet/espne...!

(8/9)

05.03.2025 16:03 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

🀯 What did we find?

❌ Both systems fails to speak up when they should and do not give user enough cues when they wants to keep conversation floor.
❌ Moshi interrupt too aggressively.
❌ Both systems rarely backchannel.
❌ User interruptions are poorly managed.

(7/9)

05.03.2025 16:03 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

We train a causal judge model on real human-human conversations that predicts turn-taking events. ⚑

Strong OOD generalization -> a reliable proxy for human judgment!

No need for costly human judgmentsβ€”our model judges the timing of turn taking events automatically!

(6/9)

05.03.2025 16:03 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Global metrics fails to evaluate when turn taking event happens!

Moshi generates overlapping speechβ€”but is it helpful or disruptive to the natural flow of the conversation? πŸ€”

(5/9)

05.03.2025 16:03 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

We compare E2E (Moshi us.moshi.chat) & cascaded (github.com/huggingface/...) dialogue systems through user study with global corpus level statistics!

Moshi: small gaps, some overlapβ€”but less than natural dialogue
Cascaded: higher latency, minimal overlap.

(4/9)

05.03.2025 16:03 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Silence β‰  turn-switching cue! 🚫 Pauses are often longer than gaps in real conversations. πŸ€¦β€β™‚οΈ

Recent audio FMs claim to have conversational abilities but limited efforts to evaluate these models on their turn taking capabilities.

(3/9)

05.03.2025 16:03 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

πŸ’‘ Why does turn-taking matter?

In human dialogue, we listen, speak, and backchannel in real-time.

Similarly the AI should know when to listen, speak, backchannel, interrupt, convey to the user when it wants to keep the conversation floor and address user interruptions

(2/9)

05.03.2025 16:03 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

πŸš€ New #ICLR2025 Paper Alert! πŸš€

Can Audio Foundation Models like Moshi and GPT-4o truly engage in natural conversations? πŸ—£οΈπŸ”Š

We benchmark their turn-taking abilities and uncover major gaps in conversational AI. πŸ§΅πŸ‘‡

πŸ“œ: arxiv.org/abs/2503.01174

05.03.2025 16:03 β€” πŸ‘ 9    πŸ” 6    πŸ’¬ 1    πŸ“Œ 0

@siddhant-arora is following 7 prominent accounts