Multimodal Speech Separation with Feedback Architecture

This thesis tackles the high computational cost, and output inconsistency of processing long audio sequences in multimodal speech separation. We introduce the Self-Feedback RE-Sepformer (SFRS), an architecture integrating an RNN-inspired incremental inference mechanism with RE-Sepformer backbone. SF...

Täydet tiedot

Bibliografiset tiedot
Päätekijä:	Qu, Yanren
Muut tekijät:	Informaatioteknologian tiedekunta, Faculty of Information Technology, Jyväskylän yliopisto, University of Jyväskylä
Aineistotyyppi:	Pro gradu
Kieli:	eng
Julkaistu:	2025
Aiheet:	Master's Degree Programme in Artificial Intelligence
Linkit:	https://jyx.jyu.fi/handle/123456789/102963

Kuvaus
Yhteenveto:	This thesis tackles the high computational cost, and output inconsistency of processing long audio sequences in multimodal speech separation. We introduce the Self-Feedback RE-Sepformer (SFRS), an architecture integrating an RNN-inspired incremental inference mechanism with RE-Sepformer backbone. SFRS processes non-overlapping audio chunks, using propagated hidden states and previous masks as feedback, guided by strategically fused textual embeddings. Experiments on TextrolMix show SFRS achieves 10.8 dB SI-SDRi, outperforming an LLM-TSE baseline. In addition, SFRS demonstrates substantial GPU memory savings, especially during inference (~7.2 GB vs. ~27.5 GB for a comparable non-incremental approach) and also in training (~20.2 GB vs. ~33.6 GB). While full convergence was limited by computational constraints, SFRS effectively mitigates long-sequence processing burdens, reduces memory usage, and enhances consistency, offering a promising path for efficient, context-aware multimodal speech separation.

Multimodal Speech Separation with Feedback Architecture

Samankaltaisia teoksia