Multimodal Speech Separation with Feedback Architecture

This thesis tackles the high computational cost, and output inconsistency of processing long audio sequences in multimodal speech separation. We introduce the Self-Feedback RE-Sepformer (SFRS), an architecture integrating an RNN-inspired incremental inference mechanism with RE-Sepformer backbone. SF...

Täydet tiedot

Bibliografiset tiedot
Päätekijä: Qu, Yanren
Muut tekijät: Informaatioteknologian tiedekunta, Faculty of Information Technology, Jyväskylän yliopisto, University of Jyväskylä
Aineistotyyppi: Pro gradu
Kieli:eng
Julkaistu: 2025
Aiheet:
Linkit: https://jyx.jyu.fi/handle/123456789/102963
Kuvaus
Yhteenveto:This thesis tackles the high computational cost, and output inconsistency of processing long audio sequences in multimodal speech separation. We introduce the Self-Feedback RE-Sepformer (SFRS), an architecture integrating an RNN-inspired incremental inference mechanism with RE-Sepformer backbone. SFRS processes non-overlapping audio chunks, using propagated hidden states and previous masks as feedback, guided by strategically fused textual embeddings. Experiments on TextrolMix show SFRS achieves 10.8 dB SI-SDRi, outperforming an LLM-TSE baseline. In addition, SFRS demonstrates substantial GPU memory savings, especially during inference (~7.2 GB vs. ~27.5 GB for a comparable non-incremental approach) and also in training (~20.2 GB vs. ~33.6 GB). While full convergence was limited by computational constraints, SFRS effectively mitigates long-sequence processing burdens, reduces memory usage, and enhances consistency, offering a promising path for efficient, context-aware multimodal speech separation.