Large Language Model Driven Context-Aware Robotic Manipulation via Human Demonstration

This thesis introduces a novel zero-shot adaptive imitation learning framework that facilitates the power of foundational models to enable robots to perform context-aware manipulation tasks only from passive video demonstrations. This architecture is inspired by the human cognitive adaptability meta...

Full description

Bibliographic Details
Main Author:	Hossain, Iftekher
Other Authors:	Informaatioteknologian tiedekunta, Faculty of Information Technology, Jyväskylän yliopisto, University of Jyväskylä
Format:	Master's thesis
Language:	eng
Published:	2025
Subjects:	Master's Degree Programme in Artificial Intelligence
Online Access:	https://jyx.jyu.fi/handle/123456789/102764

_version_	1833860119638573056
author	Hossain, Iftekher
author2	Informaatioteknologian tiedekunta Faculty of Information Technology Jyväskylän yliopisto University of Jyväskylä
author_facet	Hossain, Iftekher Informaatioteknologian tiedekunta Faculty of Information Technology Jyväskylän yliopisto University of Jyväskylä Hossain, Iftekher Informaatioteknologian tiedekunta Faculty of Information Technology Jyväskylän yliopisto University of Jyväskylä
author_sort	Hossain, Iftekher
datasource_str_mv	jyx
description	This thesis introduces a novel zero-shot adaptive imitation learning framework that facilitates the power of foundational models to enable robots to perform context-aware manipulation tasks only from passive video demonstrations. This architecture is inspired by the human cognitive adaptability metaphorically to extract the high-level knowledge from raw video demonstration and generate abstract symbolic knowledge representations without any robot-specific data. Additionally, this architecture demonstrates the versatile capabilities of foundational models in bridging perception, reasoning, and action in modern autonomous systems. The core contribution of this thesis is the Object Action Graph (OAG), a structured semantic representation that captures high-level action sequences with the objects present in the video demonstration. This abstract knowledge is generated from a foundational vision language model in a zero-shot manner. The integration of audio transcription enhances the performance of OAG knowledge representation by 60% in the secondary task recognition phase. To address challenges in single-shot imitation learning as domain shift between demonstration and robotic execution environments, this thesis proposes a Semantic Object Action Graph (SOAG). It enables the transfer of abstract task knowledge across semantically related or similar but visually dissimilar objects, allowing robots to generalize actions in novel contexts without predefined motion primitives or a huge amount of demonstrations. The proposed approach was evaluated through a series of systematically designed experiments involving three contact-rich tasks: pushing, pulling, and reaching distant objects with the help of another object. Overall, the system achieved 75% accuracy on 12 execution phases. These evaluations demonstrated the system's capability to generalize tasks, plan trajectories, and select appropriate tools based solely on contextual understanding. Despite promising results, this study highlights key challenges and drawbacks of the system, which shows a new research direction in this domain.
first_indexed	2025-05-26T20:00:39Z
format	Pro gradu
fullrecord	[{"key": "dc.contributor.advisor", "value": "Terziyan, Vagan", "language": null, "element": "contributor", "qualifier": "advisor", "schema": "dc"}, {"key": "dc.contributor.author", "value": "Hossain, Iftekher", "language": null, "element": "contributor", "qualifier": "author", "schema": "dc"}, {"key": "dc.date.accessioned", "value": "2025-05-26T10:20:52Z", "language": null, "element": "date", "qualifier": "accessioned", "schema": "dc"}, {"key": "dc.date.available", "value": "2025-05-26T10:20:52Z", "language": null, "element": "date", "qualifier": "available", "schema": "dc"}, {"key": "dc.date.issued", "value": "2025", "language": null, "element": "date", "qualifier": "issued", "schema": "dc"}, {"key": "dc.identifier.uri", "value": "https://jyx.jyu.fi/handle/123456789/102764", "language": null, "element": "identifier", "qualifier": "uri", "schema": "dc"}, {"key": "dc.description.abstract", "value": "This thesis introduces a novel zero-shot adaptive imitation learning framework that facilitates the power of foundational models to enable robots to perform context-aware manipulation tasks only from passive video demonstrations. This architecture is inspired by the human cognitive adaptability metaphorically to extract the high-level knowledge from raw video demonstration and generate abstract symbolic knowledge representations without any robot-specific data. Additionally, this architecture demonstrates the versatile capabilities of foundational models in bridging perception, reasoning, and action in modern autonomous systems.\n\nThe core contribution of this thesis is the Object Action Graph (OAG), a structured semantic representation that captures high-level action sequences with the objects present in the video demonstration. This abstract knowledge is generated from a foundational vision language model in a zero-shot manner. The integration of audio transcription enhances the performance of OAG knowledge representation by 60% in the secondary task recognition phase. To address challenges in single-shot imitation learning as domain shift between demonstration and robotic execution environments, this thesis proposes a Semantic Object Action Graph (SOAG). It enables the transfer of abstract task knowledge across semantically related or similar but visually dissimilar objects, allowing robots to generalize actions in novel contexts without predefined motion primitives or a huge amount of demonstrations.\n\nThe proposed approach was evaluated through a series of systematically designed experiments involving three contact-rich tasks: pushing, pulling, and reaching distant objects with the help of another object. Overall, the system achieved 75% accuracy on 12 execution phases. These evaluations demonstrated the system's capability to generalize tasks, plan trajectories, and select appropriate tools based solely on contextual understanding. Despite promising results, this study highlights key challenges and drawbacks of the system, which shows a new research direction in this domain.", "language": "en", "element": "description", "qualifier": "abstract", "schema": "dc"}, {"key": "dc.description.provenance", "value": "Submitted by jyx lomake-julkaisija (jyx-julkaisija.group@korppi.jyu.fi) on 2025-05-26T10:20:52Z\nNo. of bitstreams: 0", "language": "en", "element": "description", "qualifier": "provenance", "schema": "dc"}, {"key": "dc.description.provenance", "value": "Made available in DSpace on 2025-05-26T10:20:52Z (GMT). No. of bitstreams: 0", "language": "en", "element": "description", "qualifier": "provenance", "schema": "dc"}, {"key": "dc.format.extent", "value": "89", "language": null, "element": "format", "qualifier": "extent", "schema": "dc"}, {"key": "dc.format.mimetype", "value": "application/pdf", "language": null, "element": "format", "qualifier": "mimetype", "schema": "dc"}, {"key": "dc.language.iso", "value": "eng", "language": null, "element": "language", "qualifier": "iso", "schema": "dc"}, {"key": "dc.rights", "value": "In Copyright", "language": null, "element": "rights", "qualifier": null, "schema": "dc"}, {"key": "dc.title", "value": "Large Language Model Driven Context-Aware Robotic Manipulation via Human Demonstration", "language": null, "element": "title", "qualifier": null, "schema": "dc"}, {"key": "dc.type", "value": "master thesis", "language": null, "element": "type", "qualifier": null, "schema": "dc"}, {"key": "dc.identifier.urn", "value": "URN:NBN:fi:jyu-202505264594", "language": null, "element": "identifier", "qualifier": "urn", "schema": "dc"}, {"key": "dc.contributor.faculty", "value": "Informaatioteknologian tiedekunta", "language": "fi", "element": "contributor", "qualifier": "faculty", "schema": "dc"}, {"key": "dc.contributor.faculty", "value": "Faculty of Information Technology", "language": "en", "element": "contributor", "qualifier": "faculty", "schema": "dc"}, {"key": "dc.contributor.organization", "value": "Jyv\u00e4skyl\u00e4n yliopisto", "language": "fi", "element": "contributor", "qualifier": "organization", "schema": "dc"}, {"key": "dc.contributor.organization", "value": "University of Jyv\u00e4skyl\u00e4", "language": "en", "element": "contributor", "qualifier": "organization", "schema": "dc"}, {"key": "dc.subject.discipline", "value": "Master's Degree Programme in Artificial Intelligence", "language": "fi", "element": "subject", "qualifier": "discipline", "schema": "dc"}, {"key": "dc.subject.discipline", "value": "Master's Degree Programme in Artificial Intelligence", "language": "en", "element": "subject", "qualifier": "discipline", "schema": "dc"}, {"key": "dc.type.coar", "value": "http://purl.org/coar/resource_type/c_bdcc", "language": null, "element": "type", "qualifier": "coar", "schema": "dc"}, {"key": "dc.rights.copyright", "value": "\u00a9 The Author(s)", "language": null, "element": "rights", "qualifier": "copyright", "schema": "dc"}, {"key": "dc.rights.accesslevel", "value": "restrictedAccess", "language": null, "element": "rights", "qualifier": "accesslevel", "schema": "dc"}, {"key": "dc.type.publication", "value": "masterThesis", "language": null, "element": "type", "qualifier": "publication", "schema": "dc"}, {"key": "dc.format.content", "value": "fulltext", "language": null, "element": "format", "qualifier": "content", "schema": "dc"}, {"key": "dc.rights.url", "value": "https://rightsstatements.org/page/InC/1.0/", "language": null, "element": "rights", "qualifier": "url", "schema": "dc"}, {"key": "dc.rights.accessrights", "value": "Tekij\u00e4 ei ole antanut lupaa avoimeen julkaisuun, joten aineisto on luettavissa vain Jyv\u00e4skyl\u00e4n yliopiston kirjaston arkistoty\u00f6semalta. Ks. https://www.jyu.fi/fi/osc/kirjasto/tyoskentelytilat/laitteet-ja-tilat#toc-jyx-ty-asema.", "language": "fi", "element": "rights", "qualifier": "accessrights", "schema": "dc"}, {"key": "dc.rights.accessrights", "value": "The author has not given permission to make the work publicly available electronically. Therefore the material can be read only at the archival workstation at Jyv\u00e4skyl\u00e4 University Library (https://www.jyu.fi/en/osc/library/workspaces/facilities-and-equipment#toc-jyx-workstation).", "language": "en", "element": "rights", "qualifier": "accessrights", "schema": "dc"}, {"key": "dc.description.accessibilityfeature", "value": "ei tietoa saavutettavuudesta", "language": "fi", "element": "description", "qualifier": "accessibilityfeature", "schema": "dc"}, {"key": "dc.description.accessibilityfeature", "value": "unknown accessibility", "language": "en", "element": "description", "qualifier": "accessibilityfeature", "schema": "dc"}]
id	jyx.123456789_102764
language	eng
last_indexed	2025-05-26T20:02:26Z
main_date	2025-01-01T00:00:00Z
main_date_str	2025
publishDate	2025
record_format	qdc
source_str_mv	jyx
spellingShingle	Hossain, Iftekher Large Language Model Driven Context-Aware Robotic Manipulation via Human Demonstration Master's Degree Programme in Artificial Intelligence
title	Large Language Model Driven Context-Aware Robotic Manipulation via Human Demonstration
title_full	Large Language Model Driven Context-Aware Robotic Manipulation via Human Demonstration
title_fullStr	Large Language Model Driven Context-Aware Robotic Manipulation via Human Demonstration Large Language Model Driven Context-Aware Robotic Manipulation via Human Demonstration
title_full_unstemmed	Large Language Model Driven Context-Aware Robotic Manipulation via Human Demonstration Large Language Model Driven Context-Aware Robotic Manipulation via Human Demonstration
title_short	Large Language Model Driven Context-Aware Robotic Manipulation via Human Demonstration
title_sort	large language model driven context aware robotic manipulation via human demonstration
title_txtP	Large Language Model Driven Context-Aware Robotic Manipulation via Human Demonstration
topic	Master's Degree Programme in Artificial Intelligence
topic_facet	Master's Degree Programme in Artificial Intelligence
url	https://jyx.jyu.fi/handle/123456789/102764 http://www.urn.fi/URN:NBN:fi:jyu-202505264594
work_keys_str_mv	AT hossainiftekher largelanguagemodeldrivencontextawareroboticmanipulationviahumandemonstration

Large Language Model Driven Context-Aware Robotic Manipulation via Human Demonstration

Similar Items