BridgeCode: A Dual Speech Representation Paradigm for Autoregressive Zero-Shot Text-to-Speech Synthesis
Jingyuan Xing, Mingru Yang, Zhipeng Li, Xiaofen Xing, Xiangmin Xu
BridgeCode is a dual speech representation paradigm for autoregressive zero-shot TTS that predicts sparse tokens and reconstructs rich continuous features; reduces AR iterations and enhances naturalness, intelligibility, and synthesis speed.
14.10.2025 11:52 β π 0 π 0 π¬ 0 π 0
Automatic Music Sample Identification with Multi-Track Contrastive Learning
Alain Riou, Joan SerrΓ , Yuki Mitsufuji
A self-supervised learning approach for automatic music sample identification uses a multi-track dataset to create positive pairs of artificial mixes and a novel contrastive learning objective; outperforms previous state-of-the-art baselines.
14.10.2025 11:41 β π 0 π 0 π¬ 0 π 0
ILD-VIT: A Unified Vision Transformer Architecture for Detection of Interstitial Lung Disease from Respiratory Sounds
Soubhagya Ranjan Hota, Arka Roy, Udit Satija
ILD-VIT, a vision transformer framework, detects Interstitial Lung Disease from respiratory sound recordings; uses mel spectrogram image patches for classification and achieves 84.86% accuracy, sensitivity, and specificity.
14.10.2025 11:30 β π 1 π 0 π¬ 0 π 0
Audio-Maestro: Enhancing Large Audio-Language Models with Tool-Augmented Reasoning
Kuan-Yi Lee, Tsung-En Lin, Hung-Yi Lee
Audio-Maestro is a tool-augmented audio reasoning framework that enables audio-language models to autonomously call external tools and integrate timestamped outputs, improving performance by analyzing and interpreting audio signals through specialized tools.
14.10.2025 11:19 β π 1 π 0 π¬ 0 π 0
Dynamically Slimmable Speech Enhancement Network with Metric-Guided Training
Haixin Zhao, Kaixuan Yang, Nilesh Madhu
A dynamically slimmable network (DSN) for speech enhancement adaptively governs dynamic parts at a frame-wise resolution, with metric-guided training (MGT) guiding the policy module; achieves comparable performance with reduced complexity.
14.10.2025 11:08 β π 0 π 0 π¬ 0 π 0
Phase Aware Ear-Conditioned Learning for Multi-Channel Binaural Speaker Separation
Ruben Johnson Robert Jeremiah, Peyman Goli, Steven van de Par
PEASE-8, a Phase-aware Ear-conditioned speaker Separation network, introduces a raw-STFT input to the early decoder layer for improved reconstruction; delivers strong separation and intelligibility in reverberant environments.
14.10.2025 10:57 β π 0 π 0 π¬ 0 π 0
Diffusion-Link: Diffusion Probabilistic Model for Bridging the Audio-Text Modality Gap
KiHyun Nam, Jongmin Choi, Hyeongkeun Lee, Jungwoo Heo, Joon Son Chung
Diffusion-Link, a diffusion-based modality-bridging module, generatively maps audio embeddings into the text-embedding distribution; reduces the audio-text modality gap and achieves state-of-the-art performance on Automatic Audio Captioning.
14.10.2025 10:45 β π 0 π 0 π¬ 0 π 0
Perturbation Self-Supervised Representations for Cross-Lingual Emotion TTS: Stage-Wise Modeling of Emotion and Speaker
Cheng Gong, Chunyu Qiang, Tianrui Wang, Yu Jiang, Yuheng Lu, Ruihao Jing, Xiaoxiao Miao, Xiaolei Zhang, Longbiao Wang, Jianwu Dang
EMM-TTS is a two-stage cross-lingual emotional speech synthesis framework based on perturbed self-supervised learning representations, explicitly encoding prosodic cues and restoring timbre; uses Speaker Consistency Loss and Speaker-Emotion Adaptive Layer Normalization.
14.10.2025 10:34 β π 0 π 0 π¬ 0 π 0
VCB Bench: An Evaluation Benchmark for Audio-Grounded Large Language Model Conversational Agents
Jiliang Hu, Wenfu Wang, Zuchao Li, Chenxing Li, Yiyang Zhao, Hanzhao Li, Liqiang Zhang, Meng Yu, Dong Yu
VCB Bench is a Chinese benchmark built on real human speech for evaluating audio-grounded large language model conversational agents, assessing instruction following, knowledge understanding, and robustness.
14.10.2025 10:23 β π 0 π 0 π¬ 0 π 0
MSRBench: A Benchmarking Dataset for Music Source Restoration
Yongyi Zang, Jiarui Hai, Wanying Ge, Qiuqiang Kong, Zheqi Dai, Helin Wang, Yuki Mitsufuji, Mark D. Plumbley
MSRBench is a benchmark for Music Source Restoration, containing raw stem-mixture pairs across eight instrument classes; mixtures are produced by professional mixing engineers & augmented with real-world degradations.
14.10.2025 10:12 β π 1 π 0 π¬ 0 π 0
Unify Variables in Neural Scaling Laws for General Audio Representations via Embedding Effective Rank
Xuyao Deng, Yanjie Sun, Yong Dou, Kele Xu
RankMe, embedding effective rank, is used as a unifying metric to study scaling laws for general audio representations, revealing a power-law relationship between RankMe and representation quality.
14.10.2025 10:01 β π 0 π 0 π¬ 0 π 0
Perceptual Compensation of Ambisonics Recordings for Reproduction in Room
Ali Fallah, Shun Nakamura, Steven van de Par
A perceptually-motivated Ambisonics recording and rendering method compensates for playback room reverberation by spectrally and spatially compensating direct and reverberant sound field components; preserves auditory cues like DOA and IC.
14.10.2025 09:50 β π 0 π 0 π¬ 0 π 0
FAC-FACodec: Controllable Zero-Shot Foreign Accent Conversion with Factorized Speech Codec
Yurii Halychanskyi, Cameron Churchwell, Yutong Wen, Volodymyr Kindratenko
FAC-FACodec is an accent conversion framework that provides an explicit, user-controllable parameter for accent modification to target pronunciation while preserving suprasegmental cues; performance comparable to recent systems, stronger speaker identity preservation.
14.10.2025 09:39 β π 0 π 0 π¬ 0 π 0
ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis
Mohammad Javad Ranjbar Kalahroodi, Heshaam Faili, Azadeh Shakery
ParsVoice, a large-scale Persian speech corpus for TTS, features 3,526 hours of speech from 2,000 audiobooks, filtered to 1,804 hours of high-quality data with 470+ speakers; publicly available.
14.10.2025 09:28 β π 0 π 0 π¬ 0 π 0
Dual Data Scaling for Robust Two-Stage User-Defined Keyword Spotting
Zhiqi Ai, Han Cheng, Yuxin Wang, Shiyi Mu, Shugong Xu, Yongjin Zhou
DS-KWS, a two-stage framework for user-defined keyword spotting, uses a dual data scaling strategy to strengthen the acoustic model and enhance distinction of confusable words; achieves 6.13% EER and 97.85% AUC on LibriPhrase Hard.
14.10.2025 09:16 β π 0 π 0 π¬ 0 π 0
Proficiency-Aware Adaptation and Data Augmentation for Robust L2 ASR
Ling Sun, Charlotte Zhu, Shuju Shi
Proficiency-aware multitask learning (ASR with proficiency classification) and targeted augmentation (spectrogram masking) are proposed to improve L2 ASR, reducing WER and narrowing proficiency gaps.
14.10.2025 09:05 β π 0 π 0 π¬ 0 π 0
SS-DPPN: A self-supervised dual-path foundation model for the generalizable cardiac audio representation
Ummy Maria Muna, Md Mehedi Hasan Shawon, Md Jobayer, Sumaiya Akter, Md Rakibul Hasan, Md. Golam Rabiul Alam
SS-DPPN, a self-supervised dual-path prototypical network, is proposed as a foundation model for cardiac audio representation & classification; achieves state-of-the-art performance on four benchmarks, generalizes across lung sounds and heart rate.
14.10.2025 08:54 β π 0 π 0 π¬ 0 π 0
LSZone: A Lightweight Spatial Information Modeling Architecture for Real-time In-car Multi-zone Speech Separation
Jun Chen, Shichao Hu, Jiuxin Lin, Wenjie Li, Zihan Zhang, Xingchen Li, JinJiang Liu, Longshuai Xiao, Chao Weng, Lei Xie, Zhiyong Wu
LSZone, a lightweight spatial information modeling architecture for in-car multi-zone speech separation, uses a spatial information extraction-compression module and an extremely lightweight Conv-GRU crossband-narrowband processing module for real-time performance.
14.10.2025 08:43 β π 0 π 0 π¬ 0 π 0
A Machine Learning Approach for MIDI to Guitar Tablature Conversion
Maximos Kaliakatsos-Papakostas, Gregoris Bastas, Dimos Makris, Dorien Herremans, Vassilis Katsouros, Petros Maragos
A machine learning method assigns guitar tablature notation to MIDI, considering finger stretch and standard tuning; training with augmented data improves performance.
14.10.2025 08:32 β π 1 π 1 π¬ 0 π 0
MARS-Sep: Multimodal-Aligned Reinforced Sound Separation
Zihan Zhang, Xize Cheng, Zhennan Jiang, Dongjie Fu, Jingyuan Chen, Zhou Zhao, Tao Jin
MARS-Sep, a reinforcement learning framework for sound separation, uses a factorized Beta mask policy optimized by a clipped trust-region surrogate with entropy regularization & group-relative advantage normalization; multimodal rewards incentivize semantic consistency.
14.10.2025 08:21 β π 1 π 0 π¬ 0 π 0
Knowledge-Decoupled Functionally Invariant Path with Synthetic Personal Data for Personalized ASR
Yue Gu, Zhihao Du, Ying Shi, Jiqing Han, Yongjun He
KDFIP, a knowledge-decoupled functionally invariant path framework, integrates a gated parameter-isolation strategy into FIP for personalized ASR, separating generic and personalized knowledge; improves performance by 29.38%.
14.10.2025 08:10 β π 0 π 0 π¬ 0 π 0
MRSAudio: A Large-Scale Multimodal Recorded Spatial Audio Dataset with Refined Annotations
Wenxiang Guo, Changhao Pan, Zhiyuan Zhu, Xintong Hu, Yu Zhang, Li Tang, Rui Yang, Han Wang, Zongbao Zhang, Yuhan Wang, Yixuan Chen, Hankun Xu, Ke Xu, Pengfei Fan, Zhetao Chen, Yanhao Yu, Qiange Huang, Fei Wu, Zhou Zhao
MRSAudio, a large-scale multimodal spatial audio dataset, includes binaural & ambisonic audio, video, motion trajectories, and fine-grained annotations for diverse scenarios; establishes five foundational tasks including spatialization and generation.
14.10.2025 07:59 β π 1 π 0 π¬ 0 π 0
ProGress: Structured Music Generation via Graph Diffusion and Hierarchical Music Analysis
Stephen Ni-Hahn, Chao PΓ©ter Yang, Mingchen Ma, Cynthia Rudin, Simon Mak, Yue Jiang
ProGress, a generative music framework, incorporates Schenkerian analysis with a diffusion modeling framework to enhance structural cohesion; adaptations of the DiGress model, phrase fusion, and user control improve musical compositions.
14.10.2025 07:47 β π 1 π 0 π¬ 0 π 0
Peransformer: Improving Low-informed Expressive Performance Rendering with Score-aware Discriminator
Xian He, Wei Zeng, Ye Wang
Peransformer incorporates a score-aware discriminator, trained on a note-to-note aligned MIDI dataset, to improve low-informed Expressive Performance Rendering; validated by subjective evaluations, extended automatic evaluation metrics.
14.10.2025 07:36 β π 0 π 0 π¬ 0 π 0
Matchmaker: An Open-source Library for Real-time Piano Score Following and Systematic Evaluation
Jiyun Park, Carlos Cancino-ChacΓ³n, Suhit Chiruthapudi, Juhan Nam
Matchmaker, an open-source Python library, systematically compares real-time music alignment methods using music representations and alignment algorithms, evaluated on solo piano music datasets with comprehensive metrics.
14.10.2025 07:25 β π 1 π 0 π¬ 0 π 0
Improving Speech Emotion Recognition with Mutual Information Regularized Generative Model
Chung-Soo Ahn, Rajib Rana, Sunil Sivadas, Carlos Busso, Jagath C. Rajapakse
A data augmentation framework for SER uses cross-modal information transfer and mutual information regularization to improve the quality of generated data and expand the scope to multimodal inputs; tested on IEMOCAP, MSP-IMPROV & MSP-Podcast.
14.10.2025 07:14 β π 0 π 0 π¬ 0 π 0
Universal Discrete-Domain Speech Enhancement
Fei Liu, Yang Ai, Ye-Xin Lu, Rui-Chen Zheng, Hui-Peng Du, Zhen-Hua Ling
UDSE, a Universal Discrete-domain SE model, redefines speech enhancement as a discrete-domain classification task, predicting clean discrete tokens quantized by a pre-trained neural speech codec; shows superior universality.
14.10.2025 07:03 β π 0 π 0 π¬ 0 π 0
Articulation-Informed ASR: Integrating Articulatory Features into ASR via Auxiliary Speech Inversion and Cross-Attention Fusion
Ahmed Adel Attia, Jing Liu, Carol Espy Wilson
A framework using speech inversion as an auxiliary task and cross-attention to integrate predicted articulatory features into ASR models improved performance on LibriSpeech, particularly in low-resource conditions.
13.10.2025 07:29 β π 0 π 0 π¬ 0 π 0
Evaluating Hallucinations in Multimodal LLMs with Spoken Queries under Diverse Acoustic Conditions
Hansol Park, Hoseong Ahn, Junwon Moon, Yejin Lee, Kyuhong Shim
RePOPE-Spk, an audio-augmented benchmark, reveals that spoken queries under diverse acoustic conditions escalate hallucinations in multimodal large language models, increasing error rates by 3%.
13.10.2025 07:16 β π 0 π 0 π¬ 0 π 0
LadderSym: A Multimodal Interleaved Transformer for Music Practice Error Detection
Benjamin Shiue-Hal Chou, Purvish Jajal, Nick John Eliopoulos, James C. Davis, George K. Thiruvathukal, Kristen Yeon-Ji Yun, Yung-Hsiang Lu
LadderSym, a Transformer-based method for music error detection, uses a two-stream encoder with inter-stream alignment and multimodal symbolic scores as decoder prompts, doubling F1 for missed notes on MAESTRO-E.
13.10.2025 07:03 β π 0 π 0 π¬ 0 π 0