Multimodal Speech Separation with Feedback Architecture

This thesis tackles the high computational cost, and output inconsistency of processing long audio sequences in multimodal speech separation. We introduce the Self-Feedback RE-Sepformer (SFRS), an architecture integrating an RNN-inspired incremental inference mechanism with RE-Sepformer backbone. SF...

Full description

Bibliographic Details
Main Author: Qu, Yanren
Other Authors: Informaatioteknologian tiedekunta, Faculty of Information Technology, Jyväskylän yliopisto, University of Jyväskylä
Format: Master's thesis
Language:eng
Published: 2025
Subjects:
Online Access: https://jyx.jyu.fi/handle/123456789/102963
_version_ 1834494319145254912
author Qu, Yanren
author2 Informaatioteknologian tiedekunta Faculty of Information Technology Jyväskylän yliopisto University of Jyväskylä
author_facet Qu, Yanren Informaatioteknologian tiedekunta Faculty of Information Technology Jyväskylän yliopisto University of Jyväskylä Qu, Yanren Informaatioteknologian tiedekunta Faculty of Information Technology Jyväskylän yliopisto University of Jyväskylä
author_sort Qu, Yanren
datasource_str_mv jyx
description This thesis tackles the high computational cost, and output inconsistency of processing long audio sequences in multimodal speech separation. We introduce the Self-Feedback RE-Sepformer (SFRS), an architecture integrating an RNN-inspired incremental inference mechanism with RE-Sepformer backbone. SFRS processes non-overlapping audio chunks, using propagated hidden states and previous masks as feedback, guided by strategically fused textual embeddings. Experiments on TextrolMix show SFRS achieves 10.8 dB SI-SDRi, outperforming an LLM-TSE baseline. In addition, SFRS demonstrates substantial GPU memory savings, especially during inference (~7.2 GB vs. ~27.5 GB for a comparable non-incremental approach) and also in training (~20.2 GB vs. ~33.6 GB). While full convergence was limited by computational constraints, SFRS effectively mitigates long-sequence processing burdens, reduces memory usage, and enhances consistency, offering a promising path for efficient, context-aware multimodal speech separation.
first_indexed 2025-06-02T20:00:55Z
format Pro gradu
free_online_boolean 1
fullrecord [{"key": "dc.contributor.advisor", "value": "Khriyenko, Oleksiy", "language": null, "element": "contributor", "qualifier": "advisor", "schema": "dc"}, {"key": "dc.contributor.author", "value": "Qu, Yanren", "language": null, "element": "contributor", "qualifier": "author", "schema": "dc"}, {"key": "dc.date.accessioned", "value": "2025-06-02T11:57:58Z", "language": null, "element": "date", "qualifier": "accessioned", "schema": "dc"}, {"key": "dc.date.available", "value": "2025-06-02T11:57:58Z", "language": null, "element": "date", "qualifier": "available", "schema": "dc"}, {"key": "dc.date.issued", "value": "2025", "language": null, "element": "date", "qualifier": "issued", "schema": "dc"}, {"key": "dc.identifier.uri", "value": "https://jyx.jyu.fi/handle/123456789/102963", "language": null, "element": "identifier", "qualifier": "uri", "schema": "dc"}, {"key": "dc.description.abstract", "value": "This thesis tackles the high computational cost, and output inconsistency of processing long audio sequences in multimodal speech separation. We introduce the Self-Feedback RE-Sepformer (SFRS), an architecture integrating an RNN-inspired incremental inference mechanism with RE-Sepformer backbone. SFRS processes non-overlapping audio chunks, using propagated hidden states and previous masks as feedback, guided by strategically fused textual embeddings. Experiments on TextrolMix show SFRS achieves 10.8 dB SI-SDRi, outperforming an LLM-TSE baseline. In addition, SFRS demonstrates substantial GPU memory savings, especially during inference (~7.2 GB vs. ~27.5 GB for a comparable non-incremental approach) and also in training (~20.2 GB vs. ~33.6 GB). While full convergence was limited by computational constraints, SFRS effectively mitigates long-sequence processing burdens, reduces memory usage, and enhances consistency, offering a promising path for efficient, context-aware multimodal speech separation.", "language": "en", "element": "description", "qualifier": "abstract", "schema": "dc"}, {"key": "dc.description.provenance", "value": "Submitted by jyx lomake-julkaisija (jyx-julkaisija.group@korppi.jyu.fi) on 2025-06-02T11:57:58Z\nNo. of bitstreams: 0", "language": "en", "element": "description", "qualifier": "provenance", "schema": "dc"}, {"key": "dc.description.provenance", "value": "Made available in DSpace on 2025-06-02T11:57:58Z (GMT). No. of bitstreams: 0", "language": "en", "element": "description", "qualifier": "provenance", "schema": "dc"}, {"key": "dc.format.extent", "value": "61", "language": null, "element": "format", "qualifier": "extent", "schema": "dc"}, {"key": "dc.format.mimetype", "value": "application/pdf", "language": null, "element": "format", "qualifier": "mimetype", "schema": "dc"}, {"key": "dc.language.iso", "value": "eng", "language": null, "element": "language", "qualifier": "iso", "schema": "dc"}, {"key": "dc.rights", "value": "CC BY 4.0", "language": null, "element": "rights", "qualifier": null, "schema": "dc"}, {"key": "dc.title", "value": "Multimodal Speech Separation with Feedback Architecture", "language": null, "element": "title", "qualifier": null, "schema": "dc"}, {"key": "dc.type", "value": "master thesis", "language": null, "element": "type", "qualifier": null, "schema": "dc"}, {"key": "dc.identifier.urn", "value": "URN:NBN:fi:jyu-202506024772", "language": null, "element": "identifier", "qualifier": "urn", "schema": "dc"}, {"key": "dc.contributor.faculty", "value": "Informaatioteknologian tiedekunta", "language": "fi", "element": "contributor", "qualifier": "faculty", "schema": "dc"}, {"key": "dc.contributor.faculty", "value": "Faculty of Information Technology", "language": "en", "element": "contributor", "qualifier": "faculty", "schema": "dc"}, {"key": "dc.contributor.organization", "value": "Jyv\u00e4skyl\u00e4n yliopisto", "language": "fi", "element": "contributor", "qualifier": "organization", "schema": "dc"}, {"key": "dc.contributor.organization", "value": "University of Jyv\u00e4skyl\u00e4", "language": "en", "element": "contributor", "qualifier": "organization", "schema": "dc"}, {"key": "dc.subject.discipline", "value": "Master's Degree Programme in Artificial Intelligence", "language": "fi", "element": "subject", "qualifier": "discipline", "schema": "dc"}, {"key": "dc.subject.discipline", "value": "Master's Degree Programme in Artificial Intelligence", "language": "en", "element": "subject", "qualifier": "discipline", "schema": "dc"}, {"key": "dc.type.coar", "value": "http://purl.org/coar/resource_type/c_bdcc", "language": null, "element": "type", "qualifier": "coar", "schema": "dc"}, {"key": "dc.rights.copyright", "value": "\u00a9 The Author(s)", "language": null, "element": "rights", "qualifier": "copyright", "schema": "dc"}, {"key": "dc.rights.accesslevel", "value": "openAccess", "language": null, "element": "rights", "qualifier": "accesslevel", "schema": "dc"}, {"key": "dc.type.publication", "value": "masterThesis", "language": null, "element": "type", "qualifier": "publication", "schema": "dc"}, {"key": "dc.format.content", "value": "fulltext", "language": null, "element": "format", "qualifier": "content", "schema": "dc"}, {"key": "dc.rights.url", "value": "https://creativecommons.org/licenses/by/4.0/", "language": null, "element": "rights", "qualifier": "url", "schema": "dc"}, {"key": "dc.description.accessibilityfeature", "value": "ei tietoa saavutettavuudesta", "language": "fi", "element": "description", "qualifier": "accessibilityfeature", "schema": "dc"}, {"key": "dc.description.accessibilityfeature", "value": "unknown accessibility", "language": "en", "element": "description", "qualifier": "accessibilityfeature", "schema": "dc"}]
id jyx.123456789_102963
language eng
last_indexed 2025-06-02T20:01:45Z
main_date 2025-01-01T00:00:00Z
main_date_str 2025
online_boolean 1
online_urls_str_mv {"url":"https:\/\/jyx.jyu.fi\/bitstreams\/e41ed924-209b-4063-ad24-36497cd5529c\/download","text":"URN:NBN:fi:jyu-202506024772.pdf","source":"jyx","mediaType":"application\/pdf"}
publishDate 2025
record_format qdc
source_str_mv jyx
spellingShingle Qu, Yanren Multimodal Speech Separation with Feedback Architecture Master's Degree Programme in Artificial Intelligence
title Multimodal Speech Separation with Feedback Architecture
title_full Multimodal Speech Separation with Feedback Architecture
title_fullStr Multimodal Speech Separation with Feedback Architecture Multimodal Speech Separation with Feedback Architecture
title_full_unstemmed Multimodal Speech Separation with Feedback Architecture Multimodal Speech Separation with Feedback Architecture
title_short Multimodal Speech Separation with Feedback Architecture
title_sort multimodal speech separation with feedback architecture
title_txtP Multimodal Speech Separation with Feedback Architecture
topic Master's Degree Programme in Artificial Intelligence
topic_facet Master's Degree Programme in Artificial Intelligence
url https://jyx.jyu.fi/handle/123456789/102963 http://www.urn.fi/URN:NBN:fi:jyu-202506024772
work_keys_str_mv AT quyanren multimodalspeechseparationwithfeedbackarchitecture