Challenges and insights in semantic search using language models

Information Retrieval systems such as search engines, originally designed to assist users in finding information, have evolved to become more potent and have found utility in wider range of applications by incorporating contextual comprehension using Language Models. Selecting the proper Language Mo...

Täydet tiedot

Bibliografiset tiedot
Päätekijä:	Hajihashemi Varnousfaderani, Elahe
Muut tekijät:	Faculty of Information Technology, Informaatioteknologian tiedekunta, Information Technology, Informaatioteknologia, University of Jyväskylä, Jyväskylän yliopisto
Aineistotyyppi:	Pro gradu
Kieli:	eng
Julkaistu:	2023
Aiheet:	semantic search large language models generative models fine-tuning transfer learning Mathematical Information Technology Tietotekniikka 602 luonnollisen kielen käsittely tiedonhaku mallintaminen tekoäly koneoppiminen natural language processing information retrieval modelling (representation) artificial intelligence machine learning
Linkit:	https://jyx.jyu.fi/handle/123456789/92552

_version_	1835219126873751552
author	Hajihashemi Varnousfaderani, Elahe
author2	Faculty of Information Technology Informaatioteknologian tiedekunta Information Technology Informaatioteknologia University of Jyväskylä Jyväskylän yliopisto
author_facet	Hajihashemi Varnousfaderani, Elahe Faculty of Information Technology Informaatioteknologian tiedekunta Information Technology Informaatioteknologia University of Jyväskylä Jyväskylän yliopisto Hajihashemi Varnousfaderani, Elahe Faculty of Information Technology Informaatioteknologian tiedekunta Information Technology Informaatioteknologia University of Jyväskylä Jyväskylän yliopisto
author_sort	Hajihashemi Varnousfaderani, Elahe
datasource_str_mv	jyx
description	Information Retrieval systems such as search engines, originally designed to assist users in finding information, have evolved to become more potent and have found utility in wider range of applications by incorporating contextual comprehension using Language Models. Selecting the proper Language Model corresponding to the desired task is a challenging multi-objectives problem as each model has specific set of attributes which affect the performance. Accuracy, resource and time consumption are the most important objectives considered in assessing the quality of a search system. These objectives are addressed in this research by exploring the performance of two Language Models with variant characteristics in developing a semantic search pipeline. The studied Language Models include a distilled version of BERT model fine-tuned on specific task and GPT-2 as a general pre-trained model with huge number of parameters. The semantic search pipeline consisting of mapping the contents and queries into a common vector space using Large Language Model and finding the most relevant results is implemented in this study as experimental set up of the qualitative research. Utilizing evaluation metrics to assess the model’s performance necessitates the availability of ground truth data. Therefore, current research brings up various approaches aimed at generating synthetic ground truth to tackle evaluation and fine-tuning challenges when labeled data is scarce. To follow the research objectives, quantitative data is gathered through an experimental setting and conclusions are drawn and recommendations are raised by analyzing the results of the experiments. The experimental results indicate the size of the model should not be the major criterion in selecting the language model for downstream tasks. The model architecture and being fine-tuned on special dataset will dramatically affect the performance as well. As it is shown by results, the smaller fine-tuned model for semantic textual similarity surpasses the larger general model. The experiment on investigating the proposed approaches for generating annotations signifies that those methods are decently applicable in computing evaluation metrics and can be extended to fine-tuning. The results demonstrate that the task-oriented transferred learning by distillation and fine-tuning can compromise the learning capacity instilled in general models by a larger number of parameters, but it should be investigated in future research regarding the values set to various variables in this research e.g., the number of tokens considered in splitting the large text into smaller chunks. Moreover, it would be worthful to fine-tune the general large model as well in the future to compare them in a more comparable condition.
first_indexed	2024-01-08T21:01:07Z
format	Pro gradu
free_online_boolean	1
fullrecord	[{"key": "dc.contributor.advisor", "value": "Khriyenko, Oleksiy", "language": null, "element": "contributor", "qualifier": "advisor", "schema": "dc"}, {"key": "dc.contributor.author", "value": "Hajihashemi Varnousfaderani, Elahe", "language": "", "element": "contributor", "qualifier": "author", "schema": "dc"}, {"key": "dc.date.accessioned", "value": "2024-01-08T07:20:46Z", "language": null, "element": "date", "qualifier": "accessioned", "schema": "dc"}, {"key": "dc.date.available", "value": "2024-01-08T07:20:46Z", "language": null, "element": "date", "qualifier": "available", "schema": "dc"}, {"key": "dc.date.issued", "value": "2023", "language": null, "element": "date", "qualifier": "issued", "schema": "dc"}, {"key": "dc.identifier.uri", "value": "https://jyx.jyu.fi/handle/123456789/92552", "language": null, "element": "identifier", "qualifier": "uri", "schema": "dc"}, {"key": "dc.description.abstract", "value": "Information Retrieval systems such as search engines, originally designed to assist users in finding information, have evolved to become more potent and have found utility in wider range of applications by incorporating contextual comprehension using Language Models. Selecting the proper Language Model corresponding to the desired task is a challenging multi-objectives problem as each model has specific set of attributes which affect the performance. Accuracy, resource and time consumption are the most important objectives considered in assessing the quality of a search system. These objectives are addressed in this research by exploring the performance of two Language Models with variant characteristics in developing a semantic search pipeline. The studied Language Models include a distilled version of BERT model fine-tuned on specific task and GPT-2 as a general pre-trained model with huge number of parameters.\nThe semantic search pipeline consisting of mapping the contents and queries into a common vector space using Large Language Model and finding the most relevant results is implemented in this study as experimental set up of the qualitative research. Utilizing evaluation metrics to assess the model\u2019s performance necessitates the availability of ground truth data. Therefore, current research brings up various approaches aimed at generating synthetic ground truth to tackle evaluation and fine-tuning challenges when labeled data is scarce. To follow the research objectives, quantitative data is gathered through an experimental setting and conclusions are drawn and recommendations are raised by analyzing the results of the experiments. \nThe experimental results indicate the size of the model should not be the major criterion in selecting the language model for downstream tasks. The model architecture and being fine-tuned on special dataset will dramatically affect the performance as well. As it is shown by results, the smaller fine-tuned model for semantic textual similarity surpasses the larger general model. The experiment on investigating the proposed approaches for generating annotations signifies that those methods are decently applicable in computing evaluation metrics and can be extended to fine-tuning. \nThe results demonstrate that the task-oriented transferred learning by distillation and fine-tuning can compromise the learning capacity instilled in general models by a larger number of parameters, but it should be investigated in future research regarding the values set to various variables in this research e.g., the number of tokens considered in splitting the large text into smaller chunks. Moreover, it would be worthful to fine-tune the general large model as well in the future to compare them in a more comparable condition.", "language": "en", "element": "description", "qualifier": "abstract", "schema": "dc"}, {"key": "dc.description.provenance", "value": "Submitted by Paivi Vuorio (paelvuor@jyu.fi) on 2024-01-08T07:20:46Z\nNo. of bitstreams: 0", "language": "en", "element": "description", "qualifier": "provenance", "schema": "dc"}, {"key": "dc.description.provenance", "value": "Made available in DSpace on 2024-01-08T07:20:46Z (GMT). No. of bitstreams: 0\n Previous issue date: 2023", "language": "en", "element": "description", "qualifier": "provenance", "schema": "dc"}, {"key": "dc.format.extent", "value": "157", "language": "", "element": "format", "qualifier": "extent", "schema": "dc"}, {"key": "dc.language.iso", "value": "eng", "language": null, "element": "language", "qualifier": "iso", "schema": "dc"}, {"key": "dc.rights", "value": "In Copyright", "language": "en", "element": "rights", "qualifier": null, "schema": "dc"}, {"key": "dc.subject.other", "value": "semantic search", "language": "", "element": "subject", "qualifier": "other", "schema": "dc"}, {"key": "dc.subject.other", "value": "large language models", "language": "", "element": "subject", "qualifier": "other", "schema": "dc"}, {"key": "dc.subject.other", "value": "generative models", "language": "", "element": "subject", "qualifier": "other", "schema": "dc"}, {"key": "dc.subject.other", "value": "fine-tuning", "language": "", "element": "subject", "qualifier": "other", "schema": "dc"}, {"key": "dc.subject.other", "value": "transfer learning", "language": "", "element": "subject", "qualifier": "other", "schema": "dc"}, {"key": "dc.title", "value": "Challenges and insights in semantic search using language models", "language": "", "element": "title", "qualifier": null, "schema": "dc"}, {"key": "dc.type", "value": "master thesis", "language": null, "element": "type", "qualifier": null, "schema": "dc"}, {"key": "dc.identifier.urn", "value": "URN:NBN:fi:jyu-202401081055", "language": null, "element": "identifier", "qualifier": "urn", "schema": "dc"}, {"key": "dc.type.ontasot", "value": "Master\u2019s thesis", "language": "en", "element": "type", "qualifier": "ontasot", "schema": "dc"}, {"key": "dc.type.ontasot", "value": "Pro gradu -tutkielma", "language": "fi", "element": "type", "qualifier": "ontasot", "schema": "dc"}, {"key": "dc.contributor.faculty", "value": "Faculty of Information Technology", "language": "en", "element": "contributor", "qualifier": "faculty", "schema": "dc"}, {"key": "dc.contributor.faculty", "value": "Informaatioteknologian tiedekunta", "language": "fi", "element": "contributor", "qualifier": "faculty", "schema": "dc"}, {"key": "dc.contributor.department", "value": "Information Technology", "language": "en", "element": "contributor", "qualifier": "department", "schema": "dc"}, {"key": "dc.contributor.department", "value": "Informaatioteknologia", "language": "fi", "element": "contributor", "qualifier": "department", "schema": "dc"}, {"key": "dc.contributor.organization", "value": "University of Jyv\u00e4skyl\u00e4", "language": "en", "element": "contributor", "qualifier": "organization", "schema": "dc"}, {"key": "dc.contributor.organization", "value": "Jyv\u00e4skyl\u00e4n yliopisto", "language": "fi", "element": "contributor", "qualifier": "organization", "schema": "dc"}, {"key": "dc.subject.discipline", "value": "Mathematical Information Technology", "language": "en", "element": "subject", "qualifier": "discipline", "schema": "dc"}, {"key": "dc.subject.discipline", "value": "Tietotekniikka", "language": "fi", "element": "subject", "qualifier": "discipline", "schema": "dc"}, {"key": "yvv.contractresearch.funding", "value": "0", "language": "", "element": "contractresearch", "qualifier": "funding", "schema": "yvv"}, {"key": "dc.type.coar", "value": "http://purl.org/coar/resource_type/c_bdcc", "language": null, "element": "type", "qualifier": "coar", "schema": "dc"}, {"key": "dc.rights.copyright", "value": "\u00a9 The Author(s)", "language": null, "element": "rights", "qualifier": "copyright", "schema": "dc"}, {"key": "dc.rights.accesslevel", "value": "openAccess", "language": null, "element": "rights", "qualifier": "accesslevel", "schema": "dc"}, {"key": "dc.type.publication", "value": "masterThesis", "language": null, "element": "type", "qualifier": "publication", "schema": "dc"}, {"key": "dc.subject.oppiainekoodi", "value": "602", "language": null, "element": "subject", "qualifier": "oppiainekoodi", "schema": "dc"}, {"key": "dc.subject.yso", "value": "luonnollisen kielen k\u00e4sittely", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.subject.yso", "value": "tiedonhaku", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.subject.yso", "value": "mallintaminen", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.subject.yso", "value": "teko\u00e4ly", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.subject.yso", "value": "koneoppiminen", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.subject.yso", "value": "natural language processing", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.subject.yso", "value": "information retrieval", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.subject.yso", "value": "modelling (representation)", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.subject.yso", "value": "artificial intelligence", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.subject.yso", "value": "machine learning", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.rights.url", "value": "https://rightsstatements.org/page/InC/1.0/", "language": null, "element": "rights", "qualifier": "url", "schema": "dc"}, {"key": "dc.description.accessibilityfeature", "value": "unknown accessibility", "language": "en", "element": "description", "qualifier": "accessibilityfeature", "schema": "dc"}, {"key": "dc.description.accessibilityfeature", "value": "ei tietoa saavutettavuudesta", "language": "fi", "element": "description", "qualifier": "accessibilityfeature", "schema": "dc"}]
id	jyx.123456789_92552
language	eng
last_indexed	2025-06-10T20:00:49Z
main_date	2023-01-01T00:00:00Z
main_date_str	2023
online_boolean	1
online_urls_str_mv	{"url":"https:\/\/jyx.jyu.fi\/bitstreams\/8a86fa7f-e3c2-4177-8c7c-14575c388c81\/download","text":"URN:NBN:fi:jyu-202401081055.pdf","source":"jyx","mediaType":"application\/pdf"}
publishDate	2023
record_format	qdc
source_str_mv	jyx
spellingShingle	Hajihashemi Varnousfaderani, Elahe Challenges and insights in semantic search using language models semantic search large language models generative models fine-tuning transfer learning Mathematical Information Technology Tietotekniikka 602 luonnollisen kielen käsittely tiedonhaku mallintaminen tekoäly koneoppiminen natural language processing information retrieval modelling (representation) artificial intelligence machine learning
title	Challenges and insights in semantic search using language models
title_full	Challenges and insights in semantic search using language models
title_fullStr	Challenges and insights in semantic search using language models Challenges and insights in semantic search using language models
title_full_unstemmed	Challenges and insights in semantic search using language models Challenges and insights in semantic search using language models
title_short	Challenges and insights in semantic search using language models
title_sort	challenges and insights in semantic search using language models
title_txtP	Challenges and insights in semantic search using language models
topic	semantic search large language models generative models fine-tuning transfer learning Mathematical Information Technology Tietotekniikka 602 luonnollisen kielen käsittely tiedonhaku mallintaminen tekoäly koneoppiminen natural language processing information retrieval modelling (representation) artificial intelligence machine learning
topic_facet	602 Mathematical Information Technology Tietotekniikka artificial intelligence fine-tuning generative models information retrieval koneoppiminen large language models luonnollisen kielen käsittely machine learning mallintaminen modelling (representation) natural language processing semantic search tekoäly tiedonhaku transfer learning
url	https://jyx.jyu.fi/handle/123456789/92552 http://www.urn.fi/URN:NBN:fi:jyu-202401081055
work_keys_str_mv	AT hajihashemivarnousfaderanielahe challengesandinsightsinsemanticsearchusinglanguagemodels

Challenges and insights in semantic search using language models

Samankaltaisia teoksia