Large Language Model Driven Context-Aware Robotic Manipulation via Human Demonstration

This thesis introduces a novel zero-shot adaptive imitation learning framework that facilitates the power of foundational models to enable robots to perform context-aware manipulation tasks only from passive video demonstrations. This architecture is inspired by the human cognitive adaptability meta...

Täydet tiedot

Bibliografiset tiedot
Päätekijä: Hossain, Iftekher
Muut tekijät: Informaatioteknologian tiedekunta, Faculty of Information Technology, Jyväskylän yliopisto, University of Jyväskylä
Aineistotyyppi: Pro gradu
Kieli:eng
Julkaistu: 2025
Aiheet:
Linkit: https://jyx.jyu.fi/handle/123456789/102764
_version_ 1833860119638573056
author Hossain, Iftekher
author2 Informaatioteknologian tiedekunta Faculty of Information Technology Jyväskylän yliopisto University of Jyväskylä
author_facet Hossain, Iftekher Informaatioteknologian tiedekunta Faculty of Information Technology Jyväskylän yliopisto University of Jyväskylä Hossain, Iftekher Informaatioteknologian tiedekunta Faculty of Information Technology Jyväskylän yliopisto University of Jyväskylä
author_sort Hossain, Iftekher
datasource_str_mv jyx
description This thesis introduces a novel zero-shot adaptive imitation learning framework that facilitates the power of foundational models to enable robots to perform context-aware manipulation tasks only from passive video demonstrations. This architecture is inspired by the human cognitive adaptability metaphorically to extract the high-level knowledge from raw video demonstration and generate abstract symbolic knowledge representations without any robot-specific data. Additionally, this architecture demonstrates the versatile capabilities of foundational models in bridging perception, reasoning, and action in modern autonomous systems. The core contribution of this thesis is the Object Action Graph (OAG), a structured semantic representation that captures high-level action sequences with the objects present in the video demonstration. This abstract knowledge is generated from a foundational vision language model in a zero-shot manner. The integration of audio transcription enhances the performance of OAG knowledge representation by 60% in the secondary task recognition phase. To address challenges in single-shot imitation learning as domain shift between demonstration and robotic execution environments, this thesis proposes a Semantic Object Action Graph (SOAG). It enables the transfer of abstract task knowledge across semantically related or similar but visually dissimilar objects, allowing robots to generalize actions in novel contexts without predefined motion primitives or a huge amount of demonstrations. The proposed approach was evaluated through a series of systematically designed experiments involving three contact-rich tasks: pushing, pulling, and reaching distant objects with the help of another object. Overall, the system achieved 75% accuracy on 12 execution phases. These evaluations demonstrated the system's capability to generalize tasks, plan trajectories, and select appropriate tools based solely on contextual understanding. Despite promising results, this study highlights key challenges and drawbacks of the system, which shows a new research direction in this domain.
first_indexed 2025-05-26T20:00:39Z
format Pro gradu
fullrecord [{"key": "dc.contributor.advisor", "value": "Terziyan, Vagan", "language": null, "element": "contributor", "qualifier": "advisor", "schema": "dc"}, {"key": "dc.contributor.author", "value": "Hossain, Iftekher", "language": null, "element": "contributor", "qualifier": "author", "schema": "dc"}, {"key": "dc.date.accessioned", "value": "2025-05-26T10:20:52Z", "language": null, "element": "date", "qualifier": "accessioned", "schema": "dc"}, {"key": "dc.date.available", "value": "2025-05-26T10:20:52Z", "language": null, "element": "date", "qualifier": "available", "schema": "dc"}, {"key": "dc.date.issued", "value": "2025", "language": null, "element": "date", "qualifier": "issued", "schema": "dc"}, {"key": "dc.identifier.uri", "value": "https://jyx.jyu.fi/handle/123456789/102764", "language": null, "element": "identifier", "qualifier": "uri", "schema": "dc"}, {"key": "dc.description.abstract", "value": "This thesis introduces a novel zero-shot adaptive imitation learning framework that facilitates the power of foundational models to enable robots to perform context-aware manipulation tasks only from passive video demonstrations. This architecture is inspired by the human cognitive adaptability metaphorically to extract the high-level knowledge from raw video demonstration and generate abstract symbolic knowledge representations without any robot-specific data. Additionally, this architecture demonstrates the versatile capabilities of foundational models in bridging perception, reasoning, and action in modern autonomous systems.\n\nThe core contribution of this thesis is the Object Action Graph (OAG), a structured semantic representation that captures high-level action sequences with the objects present in the video demonstration. This abstract knowledge is generated from a foundational vision language model in a zero-shot manner. The integration of audio transcription enhances the performance of OAG knowledge representation by 60% in the secondary task recognition phase. To address challenges in single-shot imitation learning as domain shift between demonstration and robotic execution environments, this thesis proposes a Semantic Object Action Graph (SOAG). It enables the transfer of abstract task knowledge across semantically related or similar but visually dissimilar objects, allowing robots to generalize actions in novel contexts without predefined motion primitives or a huge amount of demonstrations.\n\nThe proposed approach was evaluated through a series of systematically designed experiments involving three contact-rich tasks: pushing, pulling, and reaching distant objects with the help of another object. Overall, the system achieved 75% accuracy on 12 execution phases. These evaluations demonstrated the system's capability to generalize tasks, plan trajectories, and select appropriate tools based solely on contextual understanding. Despite promising results, this study highlights key challenges and drawbacks of the system, which shows a new research direction in this domain.", "language": "en", "element": "description", "qualifier": "abstract", "schema": "dc"}, {"key": "dc.description.provenance", "value": "Submitted by jyx lomake-julkaisija (jyx-julkaisija.group@korppi.jyu.fi) on 2025-05-26T10:20:52Z\nNo. of bitstreams: 0", "language": "en", "element": "description", "qualifier": "provenance", "schema": "dc"}, {"key": "dc.description.provenance", "value": "Made available in DSpace on 2025-05-26T10:20:52Z (GMT). No. of bitstreams: 0", "language": "en", "element": "description", "qualifier": "provenance", "schema": "dc"}, {"key": "dc.format.extent", "value": "89", "language": null, "element": "format", "qualifier": "extent", "schema": "dc"}, {"key": "dc.format.mimetype", "value": "application/pdf", "language": null, "element": "format", "qualifier": "mimetype", "schema": "dc"}, {"key": "dc.language.iso", "value": "eng", "language": null, "element": "language", "qualifier": "iso", "schema": "dc"}, {"key": "dc.rights", "value": "In Copyright", "language": null, "element": "rights", "qualifier": null, "schema": "dc"}, {"key": "dc.title", "value": "Large Language Model Driven Context-Aware Robotic Manipulation via Human Demonstration", "language": null, "element": "title", "qualifier": null, "schema": "dc"}, {"key": "dc.type", "value": "master thesis", "language": null, "element": "type", "qualifier": null, "schema": "dc"}, {"key": "dc.identifier.urn", "value": "URN:NBN:fi:jyu-202505264594", "language": null, "element": "identifier", "qualifier": "urn", "schema": "dc"}, {"key": "dc.contributor.faculty", "value": "Informaatioteknologian tiedekunta", "language": "fi", "element": "contributor", "qualifier": "faculty", "schema": "dc"}, {"key": "dc.contributor.faculty", "value": "Faculty of Information Technology", "language": "en", "element": "contributor", "qualifier": "faculty", "schema": "dc"}, {"key": "dc.contributor.organization", "value": "Jyv\u00e4skyl\u00e4n yliopisto", "language": "fi", "element": "contributor", "qualifier": "organization", "schema": "dc"}, {"key": "dc.contributor.organization", "value": "University of Jyv\u00e4skyl\u00e4", "language": "en", "element": "contributor", "qualifier": "organization", "schema": "dc"}, {"key": "dc.subject.discipline", "value": "Master's Degree Programme in Artificial Intelligence", "language": "fi", "element": "subject", "qualifier": "discipline", "schema": "dc"}, {"key": "dc.subject.discipline", "value": "Master's Degree Programme in Artificial Intelligence", "language": "en", "element": "subject", "qualifier": "discipline", "schema": "dc"}, {"key": "dc.type.coar", "value": "http://purl.org/coar/resource_type/c_bdcc", "language": null, "element": "type", "qualifier": "coar", "schema": "dc"}, {"key": "dc.rights.copyright", "value": "\u00a9 The Author(s)", "language": null, "element": "rights", "qualifier": "copyright", "schema": "dc"}, {"key": "dc.rights.accesslevel", "value": "restrictedAccess", "language": null, "element": "rights", "qualifier": "accesslevel", "schema": "dc"}, {"key": "dc.type.publication", "value": "masterThesis", "language": null, "element": "type", "qualifier": "publication", "schema": "dc"}, {"key": "dc.format.content", "value": "fulltext", "language": null, "element": "format", "qualifier": "content", "schema": "dc"}, {"key": "dc.rights.url", "value": "https://rightsstatements.org/page/InC/1.0/", "language": null, "element": "rights", "qualifier": "url", "schema": "dc"}, {"key": "dc.rights.accessrights", "value": "Tekij\u00e4 ei ole antanut lupaa avoimeen julkaisuun, joten aineisto on luettavissa vain Jyv\u00e4skyl\u00e4n yliopiston kirjaston arkistoty\u00f6semalta. Ks. https://www.jyu.fi/fi/osc/kirjasto/tyoskentelytilat/laitteet-ja-tilat#toc-jyx-ty-asema.", "language": "fi", "element": "rights", "qualifier": "accessrights", "schema": "dc"}, {"key": "dc.rights.accessrights", "value": "The author has not given permission to make the work publicly available electronically. Therefore the material can be read only at the archival workstation at Jyv\u00e4skyl\u00e4 University Library (https://www.jyu.fi/en/osc/library/workspaces/facilities-and-equipment#toc-jyx-workstation).", "language": "en", "element": "rights", "qualifier": "accessrights", "schema": "dc"}, {"key": "dc.description.accessibilityfeature", "value": "ei tietoa saavutettavuudesta", "language": "fi", "element": "description", "qualifier": "accessibilityfeature", "schema": "dc"}, {"key": "dc.description.accessibilityfeature", "value": "unknown accessibility", "language": "en", "element": "description", "qualifier": "accessibilityfeature", "schema": "dc"}]
id jyx.123456789_102764
language eng
last_indexed 2025-05-26T20:02:26Z
main_date 2025-01-01T00:00:00Z
main_date_str 2025
publishDate 2025
record_format qdc
source_str_mv jyx
spellingShingle Hossain, Iftekher Large Language Model Driven Context-Aware Robotic Manipulation via Human Demonstration Master's Degree Programme in Artificial Intelligence
title Large Language Model Driven Context-Aware Robotic Manipulation via Human Demonstration
title_full Large Language Model Driven Context-Aware Robotic Manipulation via Human Demonstration
title_fullStr Large Language Model Driven Context-Aware Robotic Manipulation via Human Demonstration Large Language Model Driven Context-Aware Robotic Manipulation via Human Demonstration
title_full_unstemmed Large Language Model Driven Context-Aware Robotic Manipulation via Human Demonstration Large Language Model Driven Context-Aware Robotic Manipulation via Human Demonstration
title_short Large Language Model Driven Context-Aware Robotic Manipulation via Human Demonstration
title_sort large language model driven context aware robotic manipulation via human demonstration
title_txtP Large Language Model Driven Context-Aware Robotic Manipulation via Human Demonstration
topic Master's Degree Programme in Artificial Intelligence
topic_facet Master's Degree Programme in Artificial Intelligence
url https://jyx.jyu.fi/handle/123456789/102764 http://www.urn.fi/URN:NBN:fi:jyu-202505264594
work_keys_str_mv AT hossainiftekher largelanguagemodeldrivencontextawareroboticmanipulationviahumandemonstration