AudioGen-Omni: A Unified Multimodal Diffusion Transformer for Video-Synchronized Audio, Speech, and Song Generation
Le Wang, Jun Wang, Feng Deng, Chen Zhang, Kun Gai, Di Zhang
AudioGen-Omni, a unified multimodal diffusion transformer, generates video-synchronized audio, speech, and songs using a novel joint training paradigm and AdaLN-based joint attention; achieves SOTA results on Text-to-Audio/Speech/Song tasks with improved efficiency.
04.08.2025 11:33 β π 0 π 0 π¬ 0 π 0
Dynamic Real-Time Ambisonics Order Adaptation for Immersive Networked Music Performances
Paolo Ostan, Carlo Centofanti, Mirco Pezzoli, Alberto Bernardini, Claudia Rinaldi, Fabio Antonacci
An adaptive higher-order Ambisonics strategy dynamically scales Ambisonics order based on network throughput, balancing immersion and reliability in Networked Music Performance scenarios.
04.08.2025 10:56 β π 1 π 0 π¬ 0 π 0
VR-PTOLEMAIC: A Virtual Environment for the Perceptual Testing of Spatial Audio Algorithms
Paolo Ostan, Francesca Del Gaudio, Federico Miotello, Mirco Pezzoli, Fabio Antonacci
VR-PTOLEMAIC, a virtual reality system, implements the MUSHRA methodology for perceptual evaluation of spatial audio algorithms; user feedback indicates positive user experience and immersivity.
04.08.2025 10:18 β π 0 π 0 π¬ 0 π 0
Wavelet-Based Time-Frequency Fingerprinting for Feature Extraction of Traditional Irish Music
Noah Shore
Continuous wavelet transform extracts spectral features; wavelet coherence analysis compares recorded audio spectrograms to synthetically generated Irish tunes derived from ABC notation for audio identification.
04.08.2025 09:41 β π 0 π 0 π¬ 0 π 0
Advancing Speech Quality Assessment Through Scientific Challenges and Open-source Activities
Wen-Chin Huang
Recent scientific challenges and open-source activities have stimulated the development of automatic Speech Quality Assessment (SQA) methods, leading to their increased use in generative AI research.
04.08.2025 09:03 β π 0 π 0 π¬ 0 π 0
Beamformed 360 Sound Maps: U-Net-Driven Acoustic Source Segmentation and Localization
Belman Jahir Rodriguez, Sergio F. Chevtchenko, Marcelo Herrera Martinez, Yeshwant Bethy, Saeed Afshar
A U-Net model segments beamformed audio maps for 360 acoustic source localization, trained on real-world drone recordings and GPS telemetry, generalizing across environments and improving angular precision.
04.08.2025 08:26 β π 0 π 0 π¬ 0 π 0
Ambisonics Super-Resolution Using A Waveform-Domain Neural Network
Ismael Nawfal, Symeon Delikaris Manias, Mehrez Souden, Juha Merimaa, Joshua Atkins, Elisabeth McMullin, Shadi Pirhosseinloo, Daniel Phillips
A fully convolutional time-domain audio neural network (Conv-TasNet) upscales First-Order Ambisonics (FOA) to Higher-Order Ambisonics (HOA), improving spatial accuracy; quantitative evaluations show a 0.6dB average positional mean squared error difference.
04.08.2025 07:48 β π 0 π 0 π¬ 0 π 0
Melody-Lyrics Matching with Contrastive Alignment Loss
Changhong Wang, Michel Olvera, GaΓ«l Richard
Melody-Lyrics Matching (MLM) retrieves lyrics for a given melody using a self-supervised framework with contrastive alignment loss; introduces sylphone, a syllable-level lyric representation.
04.08.2025 07:11 β π 0 π 0 π¬ 0 π 0
Identifying Hearing Difficulty Moments in Conversational Audio
Jack Collins, Adrian Buzea, Chris Collier, Alejandro Ballesta Rosen, Julian Maclaren, Richard F. Lyon, Simon Carlile
Audio language models were shown to excel at continuously detecting utterances identifying Hearing Difficulty Moments in conversational audio, outperforming ASR hotword heuristic and Wav2Vec fine-tuning.
01.08.2025 11:31 β π 0 π 0 π¬ 0 π 0
MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks
Yadong Niu, Tianzi Wang, Heinrich Dinkel, Xingwei Sun, Jiahao Zhou, Gang Li, Jizhong Liu, Xunying Liu, Junbo Zhang, Jian Luan
MECAT, a Multi-Expert Constructed Benchmark for Fine-Grained Audio Understanding Tasks, was introduced, with fine-grained captions and QA pairs; DATE, a novel metric, penalizes generic terms and rewards detailed descriptions.
01.08.2025 10:53 β π 0 π 0 π¬ 0 π 0
"I made this (sort of)": Negotiating authorship, confronting fraudulence, and exploring new musical spaces with prompt-based AI music generation
Bob L. T. Sturm
Two music albums using prompt-based AI music generation were created; an LLM interview explored authorship, musical identity, and new musical spaces.
01.08.2025 10:16 β π 0 π 0 π¬ 0 π 0
CUHK-EE Systems for the vTAD Challenge at NCMMSC 2025
Aemon Yat Fei Chiu, Jingyu Li, Yusheng Tian, Guangyan Zhang, Tan Lee
WavLM-Large embeddings with attentive statistical pooling and Diff-Net variants (FFN, SE-ResFFN) were used for Voice Timbre Attribute Detection (vTAD); WavLM-Large+FFN generalized better to unseen speakers, achieving 77.96%.
01.08.2025 09:38 β π 0 π 0 π¬ 0 π 0
Feature Importance across Domains for Improving Non-Intrusive Speech Intelligibility Prediction in Hearing Aids
Ryandhimas E. Zezario, Sabato M. Siniscalchi, Fei Chen, Hsin-Min Wang, Yu Tsao
Feature Importance across Domains (FiDo) estimates feature importance on spectral, time-domain acoustic features, and latent Whisper representations; FiDo was incorporated into MBI-Net+, reducing RMSE by 7.62%.
01.08.2025 09:01 β π 0 π 0 π¬ 0 π 0
Full-Duplex-Bench v1.5: Evaluating Overlap Handling for Full-Duplex Speech Models
Guan-Ting Lin, Shih-Yun Shan Kuan, Qirui Wang, Jiachen Lian, Tingle Li, Hung-yi Lee
Full-Duplex-Bench v1.5, a benchmark simulating overlap scenarios, was introduced; benchmarking five agents revealed repair-first rapid yielding vs. continuity-first sustained flow strategies, with scenario-dependent performance trends.
01.08.2025 08:23 β π 1 π 0 π¬ 0 π 0
Balancing Information Preservation and Disentanglement in Self-Supervised Music Representation Learning
Julia Wilkins, Sivan Ding, Magdalena Fuentes, Juan Pablo Bello
A multi-view SSL framework combining contrastive and reconstructive objectives was proposed for disentangling music audio representations; reconstruction and contrastive strategies complement each other for attribute disentanglement.
01.08.2025 07:46 β π 0 π 0 π¬ 0 π 0
Exploring Dynamic Parameters for Vietnamese Gender-Independent ASR
Sotheara Leang (CADT, M-PSI), Γric Castelli (M-PSI), Dominique Vaufreydaz (M-PSI), Sethserey Sam (CADT)
Dynamic SSCF-based polar parameters and SSCF0 pseudo-feature were combined with MFCCs in Vietnamese ASR; the proposed parameters reduced word error rates and showed greater gender independence.
01.08.2025 07:08 β π 0 π 0 π¬ 0 π 0
Next Tokens Denoising for Speech Synthesis
Yanqing Liu, Ruiqing Xue, Chong Zhang, Yufei Liu, Gang Wang, Bohan Li, Yao Qian, Lei He, Shujie Liu, Sheng Zhao
Dragon-FM, a text-to-speech model unifies AR and flow-matching; AR models across chunks, flow-matching within chunks, enabling KV-cache and future context use.
31.07.2025 11:34 β π 0 π 0 π¬ 0 π 0
A k-space approach to modeling multi-channel parametric array loudspeaker systems
Tao Zhuang, Longbiao He, Feng Niu, Jia-Xin Zhong, Jing Lu
A k-space approach models multi-channel parametric array loudspeaker systems, solving the linear ultrasound field and computing the quasilinear audio sound field in k-space; achieves high computational and memory efficiency.
31.07.2025 11:00 β π 0 π 0 π¬ 0 π 0
Adaptive Duration Model for Text Speech Alignment
Junjie Cao
A novel duration prediction framework gives compromising phoneme-level duration distribution; improves alignment accuracy by ~11% and enhances zero-shot TTS model robustness.
31.07.2025 10:27 β π 0 π 0 π¬ 0 π 0
Modeling Multi-Level Hearing Loss for Speech Intelligibility Prediction
Xiajie Zhou, Candy Olivia Mawalim, Masashi Unoki
Speech intelligibility prediction method simulates auditory degradations by broadening cochlear filters and low-pass modulation filtering; Vision Transformer integrates STM maps and NCC embeddings, outperforming HASPI v2.
31.07.2025 09:54 β π 0 π 0 π¬ 0 π 0
The Risks and Detection of Overestimated Privacy Protection in Voice Anonymisation
Michele Panariello, Sarina Meyer, Pierre Champion, Xiaoxiao Miao, Massimiliano Todisco, Ngoc Thang Vu, Nicholas Evans
Overestimated privacy protection in voice anonymisation is demonstrated; performance is overestimated when the verification system is poorly trained, with mismatched data leading to exaggerated performance reports.
31.07.2025 09:20 β π 0 π 0 π¬ 0 π 0
A Two-Step Learning Framework for Enhancing Sound Event Localization and Detection
Hogeon Yu
A two-step learning framework with tracwise reordering maintains temporal consistency; trains SED and DoA networks separately, then fuses features to enhance SELD, improving event classification and localization.
31.07.2025 08:47 β π 0 π 0 π¬ 0 π 0
Quantum-Inspired Audio Unlearning: Towards Privacy-Preserving Voice Biometrics
Shreyansh Pathak, Sonu Shreshtha, Richa Singh, Mayank Vatsa
QPAudioEraser, a quantum-inspired audio unlearning framework, erases individual-specific voice signatures by weight initialization using destructive interference, superposition-based label transformations, uncertainty-maximizing quantum loss function, and entanglement-inspired mixing.
31.07.2025 08:14 β π 1 π 0 π¬ 0 π 0
Tiny Noise-Robust Voice Activity Detector for Voice Assistants
Hamed Jafarzadeh Asl, Mahsa Ghazvini Nejad, Amin Edraki, Masoud Asgharian, Vahid Partovi Nia
Tiny noise-robust VAD combines a light-weight VAD with data pre/post-processing; enhances accuracy in noisy environments without larger models or fine-tuning, also improving clean speech detection.
31.07.2025 07:40 β π 0 π 0 π¬ 0 π 0
Scaling and Distilling Transformer Models for sEMG
Nicholas Mehlman, Jean-Christophe Gagnon-Audet, Michael Shvartsman, Kelvin Niu, Alexander H. Miller, Shagun Sodhani
Vanilla transformer models effectively scaled up on sEMG data to 110M parameters, improving cross-user performance; distilled into 50x smaller models with minimal performance loss.
31.07.2025 07:07 β π 0 π 0 π¬ 0 π 0
Whilter: A Whisper-based Data Filter for "In-the-Wild" Speech Corpora Using Utterance-level Multi-Task Classification
William Ravenscroft, George Close, Kit Bower-Morris, Jamie Stacey, Dmitry Sityaev, Kris Y. Hong
Whilter, a Whisper-based model, uses utterance-level multi-task classification to filter undesirable features in-the-wild speech corpora, achieving F1 scores above 85.
30.07.2025 11:08 β π 1 π 0 π¬ 0 π 0
SpeechFake: A Large-Scale Multilingual Speech Deepfake Dataset Incorporating Cutting-Edge Generation Methods
Wen Huang, Yanmei Gu, Zhiming Wang, Huijia Zhu, Yanmin Qian
SpeechFake, a large-scale multilingual speech deepfake dataset, contains over 3 million samples generated using 40 speech synthesis tools; baseline detection models show strong performance on seen and unseen test sets.
30.07.2025 10:08 β π 0 π 0 π¬ 0 π 0
Real-Time Audio-Visual Speech Enhancement Using Pre-trained Visual Representations
Teng (Aleksandra), Ma, Sile Yin, Li-Chia Yang, Shuo Zhang
RAVEN, a real-time audio-visual speech enhancement system, isolates on-screen target speakers using visual embeddings from AVSR and ASD models; concatenating embeddings improves performance in low-SNR, multi-speaker environments.
30.07.2025 09:08 β π 0 π 0 π¬ 0 π 0
Relationship between objective and subjective perceptual measures of speech in individuals with head and neck cancer
Bence Mark Halpern, Thomas Tienkamp, Teja Rebernik, Rob J. J. H. van Son, Martijn Wieling, Defne Abur, Tomoki Toda
Subjective speech assessments of HNC patients correlated with objective acoustic measures; intelligibility, articulation, and voice quality showed strong correlations; a single intelligibility measure may be sufficient for clinical monitoring.
30.07.2025 08:08 β π 0 π 0 π¬ 0 π 0
Combolutional Neural Networks
Cameron Churchwell, Minje Kim, Paris Smaragdis
Combolutional layers, learned-delay IIR comb filters and fused envelope detectors, extract harmonic features in the time domain for audio tasks; outperforms convolutional layers in piano transcription, speaker classification, and key detection.
30.07.2025 07:07 β π 0 π 0 π¬ 0 π 0