Multimodal Speech Separation with Feedback Architecture

This thesis tackles the high computational cost, and output inconsistency of processing long audio sequences in multimodal speech separation. We introduce the Self-Feedback RE-Sepformer (SFRS), an architecture integrating an RNN-inspired incremental inference mechanism with RE-Sepformer backbone. SF...

Full description

Bibliographic Details
Main Author:	Qu, Yanren
Other Authors:	Informaatioteknologian tiedekunta, Faculty of Information Technology, Jyväskylän yliopisto, University of Jyväskylä
Format:	Master's thesis
Language:	eng
Published:	2025
Subjects:	Master's Degree Programme in Artificial Intelligence
Online Access:	https://jyx.jyu.fi/handle/123456789/102963

_version_	1834494319145254912
author	Qu, Yanren
author2	Informaatioteknologian tiedekunta Faculty of Information Technology Jyväskylän yliopisto University of Jyväskylä
author_facet	Qu, Yanren Informaatioteknologian tiedekunta Faculty of Information Technology Jyväskylän yliopisto University of Jyväskylä Qu, Yanren Informaatioteknologian tiedekunta Faculty of Information Technology Jyväskylän yliopisto University of Jyväskylä
author_sort	Qu, Yanren
datasource_str_mv	jyx
description	This thesis tackles the high computational cost, and output inconsistency of processing long audio sequences in multimodal speech separation. We introduce the Self-Feedback RE-Sepformer (SFRS), an architecture integrating an RNN-inspired incremental inference mechanism with RE-Sepformer backbone. SFRS processes non-overlapping audio chunks, using propagated hidden states and previous masks as feedback, guided by strategically fused textual embeddings. Experiments on TextrolMix show SFRS achieves 10.8 dB SI-SDRi, outperforming an LLM-TSE baseline. In addition, SFRS demonstrates substantial GPU memory savings, especially during inference (~7.2 GB vs. ~27.5 GB for a comparable non-incremental approach) and also in training (~20.2 GB vs. ~33.6 GB). While full convergence was limited by computational constraints, SFRS effectively mitigates long-sequence processing burdens, reduces memory usage, and enhances consistency, offering a promising path for efficient, context-aware multimodal speech separation.
first_indexed	2025-06-02T20:00:55Z
format	Pro gradu
free_online_boolean	1
fullrecord	[{"key": "dc.contributor.advisor", "value": "Khriyenko, Oleksiy", "language": null, "element": "contributor", "qualifier": "advisor", "schema": "dc"}, {"key": "dc.contributor.author", "value": "Qu, Yanren", "language": null, "element": "contributor", "qualifier": "author", "schema": "dc"}, {"key": "dc.date.accessioned", "value": "2025-06-02T11:57:58Z", "language": null, "element": "date", "qualifier": "accessioned", "schema": "dc"}, {"key": "dc.date.available", "value": "2025-06-02T11:57:58Z", "language": null, "element": "date", "qualifier": "available", "schema": "dc"}, {"key": "dc.date.issued", "value": "2025", "language": null, "element": "date", "qualifier": "issued", "schema": "dc"}, {"key": "dc.identifier.uri", "value": "https://jyx.jyu.fi/handle/123456789/102963", "language": null, "element": "identifier", "qualifier": "uri", "schema": "dc"}, {"key": "dc.description.abstract", "value": "This thesis tackles the high computational cost, and output inconsistency of processing long audio sequences in multimodal speech separation. We introduce the Self-Feedback RE-Sepformer (SFRS), an architecture integrating an RNN-inspired incremental inference mechanism with RE-Sepformer backbone. SFRS processes non-overlapping audio chunks, using propagated hidden states and previous masks as feedback, guided by strategically fused textual embeddings. Experiments on TextrolMix show SFRS achieves 10.8 dB SI-SDRi, outperforming an LLM-TSE baseline. In addition, SFRS demonstrates substantial GPU memory savings, especially during inference (~7.2 GB vs. ~27.5 GB for a comparable non-incremental approach) and also in training (~20.2 GB vs. ~33.6 GB). While full convergence was limited by computational constraints, SFRS effectively mitigates long-sequence processing burdens, reduces memory usage, and enhances consistency, offering a promising path for efficient, context-aware multimodal speech separation.", "language": "en", "element": "description", "qualifier": "abstract", "schema": "dc"}, {"key": "dc.description.provenance", "value": "Submitted by jyx lomake-julkaisija (jyx-julkaisija.group@korppi.jyu.fi) on 2025-06-02T11:57:58Z\nNo. of bitstreams: 0", "language": "en", "element": "description", "qualifier": "provenance", "schema": "dc"}, {"key": "dc.description.provenance", "value": "Made available in DSpace on 2025-06-02T11:57:58Z (GMT). No. of bitstreams: 0", "language": "en", "element": "description", "qualifier": "provenance", "schema": "dc"}, {"key": "dc.format.extent", "value": "61", "language": null, "element": "format", "qualifier": "extent", "schema": "dc"}, {"key": "dc.format.mimetype", "value": "application/pdf", "language": null, "element": "format", "qualifier": "mimetype", "schema": "dc"}, {"key": "dc.language.iso", "value": "eng", "language": null, "element": "language", "qualifier": "iso", "schema": "dc"}, {"key": "dc.rights", "value": "CC BY 4.0", "language": null, "element": "rights", "qualifier": null, "schema": "dc"}, {"key": "dc.title", "value": "Multimodal Speech Separation with Feedback Architecture", "language": null, "element": "title", "qualifier": null, "schema": "dc"}, {"key": "dc.type", "value": "master thesis", "language": null, "element": "type", "qualifier": null, "schema": "dc"}, {"key": "dc.identifier.urn", "value": "URN:NBN:fi:jyu-202506024772", "language": null, "element": "identifier", "qualifier": "urn", "schema": "dc"}, {"key": "dc.contributor.faculty", "value": "Informaatioteknologian tiedekunta", "language": "fi", "element": "contributor", "qualifier": "faculty", "schema": "dc"}, {"key": "dc.contributor.faculty", "value": "Faculty of Information Technology", "language": "en", "element": "contributor", "qualifier": "faculty", "schema": "dc"}, {"key": "dc.contributor.organization", "value": "Jyv\u00e4skyl\u00e4n yliopisto", "language": "fi", "element": "contributor", "qualifier": "organization", "schema": "dc"}, {"key": "dc.contributor.organization", "value": "University of Jyv\u00e4skyl\u00e4", "language": "en", "element": "contributor", "qualifier": "organization", "schema": "dc"}, {"key": "dc.subject.discipline", "value": "Master's Degree Programme in Artificial Intelligence", "language": "fi", "element": "subject", "qualifier": "discipline", "schema": "dc"}, {"key": "dc.subject.discipline", "value": "Master's Degree Programme in Artificial Intelligence", "language": "en", "element": "subject", "qualifier": "discipline", "schema": "dc"}, {"key": "dc.type.coar", "value": "http://purl.org/coar/resource_type/c_bdcc", "language": null, "element": "type", "qualifier": "coar", "schema": "dc"}, {"key": "dc.rights.copyright", "value": "\u00a9 The Author(s)", "language": null, "element": "rights", "qualifier": "copyright", "schema": "dc"}, {"key": "dc.rights.accesslevel", "value": "openAccess", "language": null, "element": "rights", "qualifier": "accesslevel", "schema": "dc"}, {"key": "dc.type.publication", "value": "masterThesis", "language": null, "element": "type", "qualifier": "publication", "schema": "dc"}, {"key": "dc.format.content", "value": "fulltext", "language": null, "element": "format", "qualifier": "content", "schema": "dc"}, {"key": "dc.rights.url", "value": "https://creativecommons.org/licenses/by/4.0/", "language": null, "element": "rights", "qualifier": "url", "schema": "dc"}, {"key": "dc.description.accessibilityfeature", "value": "ei tietoa saavutettavuudesta", "language": "fi", "element": "description", "qualifier": "accessibilityfeature", "schema": "dc"}, {"key": "dc.description.accessibilityfeature", "value": "unknown accessibility", "language": "en", "element": "description", "qualifier": "accessibilityfeature", "schema": "dc"}]
id	jyx.123456789_102963
language	eng
last_indexed	2025-06-02T20:01:45Z
main_date	2025-01-01T00:00:00Z
main_date_str	2025
online_boolean	1
online_urls_str_mv	{"url":"https:\/\/jyx.jyu.fi\/bitstreams\/e41ed924-209b-4063-ad24-36497cd5529c\/download","text":"URN:NBN:fi:jyu-202506024772.pdf","source":"jyx","mediaType":"application\/pdf"}
publishDate	2025
record_format	qdc
source_str_mv	jyx
spellingShingle	Qu, Yanren Multimodal Speech Separation with Feedback Architecture Master's Degree Programme in Artificial Intelligence
title	Multimodal Speech Separation with Feedback Architecture
title_full	Multimodal Speech Separation with Feedback Architecture
title_fullStr	Multimodal Speech Separation with Feedback Architecture Multimodal Speech Separation with Feedback Architecture
title_full_unstemmed	Multimodal Speech Separation with Feedback Architecture Multimodal Speech Separation with Feedback Architecture
title_short	Multimodal Speech Separation with Feedback Architecture
title_sort	multimodal speech separation with feedback architecture
title_txtP	Multimodal Speech Separation with Feedback Architecture
topic	Master's Degree Programme in Artificial Intelligence
topic_facet	Master's Degree Programme in Artificial Intelligence
url	https://jyx.jyu.fi/handle/123456789/102963 http://www.urn.fi/URN:NBN:fi:jyu-202506024772
work_keys_str_mv	AT quyanren multimodalspeechseparationwithfeedbackarchitecture

Multimodal Speech Separation with Feedback Architecture

Similar Items