Improving search engine results using different machine learning models and tools

The aim of this thesis is to provide viable methods that can be used to improve the return position (RP) of a relevant document when a natural language query (NLQ) is applied by a user. For the purpose of demonstration, we will be using IBM's Watson Discovery Service (WDS) as a search engine t...

Täydet tiedot

Bibliografiset tiedot
Päätekijä:	Ambaye, Michael
Muut tekijät:	Informaatioteknologian tiedekunta, Faculty of Information Technology, Informaatioteknologia, Information Technology, Jyväskylän yliopisto, University of Jyväskylä
Aineistotyyppi:	Pro gradu
Kieli:	eng
Julkaistu:	2020
Aiheet:	IBM Watson natural language query Watson discovery Tietotekniikka Mathematical Information Technology 602 tiedonhaku hakuohjelmat koneoppiminen big data kyselykielet tiedonhallinta luonnollinen kieli kieli ja kielet Query tiedonhakujärjestelmät information retrieval search engines machine learning query languages information management natural language languages information retrieval systems
Linkit:	https://jyx.jyu.fi/handle/123456789/73119

_version_	1833407657064529920
author	Ambaye, Michael
author2	Informaatioteknologian tiedekunta Faculty of Information Technology Informaatioteknologia Information Technology Jyväskylän yliopisto University of Jyväskylä
author_facet	Ambaye, Michael Informaatioteknologian tiedekunta Faculty of Information Technology Informaatioteknologia Information Technology Jyväskylän yliopisto University of Jyväskylä Ambaye, Michael Informaatioteknologian tiedekunta Faculty of Information Technology Informaatioteknologia Information Technology Jyväskylän yliopisto University of Jyväskylä
author_sort	Ambaye, Michael
datasource_str_mv	jyx
description	The aim of this thesis is to provide viable methods that can be used to improve the return position (RP) of a relevant document when a natural language query (NLQ) is applied by a user. For the purpose of demonstration, we will be using IBM's Watson Discovery Service (WDS) as a search engine that uses supervised machine learning. This feature of WDS enables a user to train the tool so that it can learn to associate the language used in the NLQ to the language used in documents labelled as relevant. Therefore, instead of mapping an NLQ to the relevant document, it will build a model that works in such a way that similar language used in the natural language query will be associated with documents containing similar language as the document labeled as relevant. The search engine works in such a way that it first searches for the first 100 documents and then ranks the documents based on the training examples provided by the user. In other words, the training example is only applied after the search is complete and the first 100 documents are collected. The first 100 documents are retrieved based on what has been enabled from options such as: keywords, entities, relations, semantic roles, concept, category classification, sentiment analysis, emotion analysis, and element classification (Watson Discovery Service, 2019). Bringing 100 documents to be re-ranked for NLQ presents a challenge when the user uses a language that is not present in the documents ingested. For example, the documents ingested could be technical documents using official languages and the user could be using a search word that is commonly used among colleagues. This would mean that even when the training example is present for the type of language used by the user pointing to relevant document, the user will not be able to get the expected documents because they will not have been inside the first 100 documents and therefore will not be re-ranked. Therefore, in this thesis, we will be going through various tools and methods that would enable us to improve the return position of relevant documents that a user expects.
first_indexed	2020-12-11T21:01:32Z
format	Pro gradu
free_online_boolean	1
fullrecord	[{"key": "dc.contributor.advisor", "value": "Khriyenko, Oleksiy", "language": "", "element": "contributor", "qualifier": "advisor", "schema": "dc"}, {"key": "dc.contributor.author", "value": "Ambaye, Michael", "language": "", "element": "contributor", "qualifier": "author", "schema": "dc"}, {"key": "dc.date.accessioned", "value": "2020-12-11T07:49:35Z", "language": null, "element": "date", "qualifier": "accessioned", "schema": "dc"}, {"key": "dc.date.available", "value": "2020-12-11T07:49:35Z", "language": null, "element": "date", "qualifier": "available", "schema": "dc"}, {"key": "dc.date.issued", "value": "2020", "language": "", "element": "date", "qualifier": "issued", "schema": "dc"}, {"key": "dc.identifier.uri", "value": "https://jyx.jyu.fi/handle/123456789/73119", "language": null, "element": "identifier", "qualifier": "uri", "schema": "dc"}, {"key": "dc.description.abstract", "value": "The aim of this thesis is to provide viable methods that can be used to improve the return position (RP) of a relevant document when a natural language query (NLQ) is applied by a user. \nFor the purpose of demonstration, we will be using IBM's Watson Discovery Service (WDS) as a search engine that uses supervised machine learning. This feature of WDS enables a user to train the tool so that it can learn to associate the language used in the NLQ to the language used in documents labelled as relevant. Therefore, instead of mapping an NLQ to the relevant document, it will build a model that works in such a way that similar language used in the natural language query will be associated with documents containing similar language as the document labeled as relevant. \nThe search engine works in such a way that it first searches for the first 100 documents and then ranks the documents based on the training examples provided by the user. In other words, the training example is only applied after the search is complete and the first 100 documents are collected. \nThe first 100 documents are retrieved based on what has been enabled from options such as: keywords, entities, relations, semantic roles, concept, category classification, sentiment analysis, emotion analysis, and element classification (Watson Discovery Service, 2019). \nBringing 100 documents to be re-ranked for NLQ presents a challenge when the user uses a language that is not present in the documents ingested. For example, the documents ingested could be technical documents using official languages and the user could be using a search word that is commonly used among colleagues. This would mean that even when the training example is present for the type of language used by the user pointing to relevant document, the user will not be able to get the expected documents because they will not have been inside the first 100 documents and therefore will not be re-ranked. Therefore, in this thesis, we will be going through various tools and methods that would enable us to improve the return position of relevant documents that a user expects.", "language": "en", "element": "description", "qualifier": "abstract", "schema": "dc"}, {"key": "dc.description.provenance", "value": "Submitted by Paivi Vuorio (paelvuor@jyu.fi) on 2020-12-11T07:49:35Z\nNo. of bitstreams: 0", "language": "en", "element": "description", "qualifier": "provenance", "schema": "dc"}, {"key": "dc.description.provenance", "value": "Made available in DSpace on 2020-12-11T07:49:35Z (GMT). No. of bitstreams: 0\n Previous issue date: 2020", "language": "en", "element": "description", "qualifier": "provenance", "schema": "dc"}, {"key": "dc.format.extent", "value": "67", "language": "", "element": "format", "qualifier": "extent", "schema": "dc"}, {"key": "dc.format.mimetype", "value": "application/pdf", "language": null, "element": "format", "qualifier": "mimetype", "schema": "dc"}, {"key": "dc.language.iso", "value": "eng", "language": null, "element": "language", "qualifier": "iso", "schema": "dc"}, {"key": "dc.rights", "value": "In Copyright", "language": "en", "element": "rights", "qualifier": null, "schema": "dc"}, {"key": "dc.subject.other", "value": "IBM Watson", "language": "", "element": "subject", "qualifier": "other", "schema": "dc"}, {"key": "dc.subject.other", "value": "natural language query", "language": "", "element": "subject", "qualifier": "other", "schema": "dc"}, {"key": "dc.subject.other", "value": "Watson discovery", "language": "", "element": "subject", "qualifier": "other", "schema": "dc"}, {"key": "dc.title", "value": "Improving search engine results using different machine learning models and tools", "language": "", "element": "title", "qualifier": null, "schema": "dc"}, {"key": "dc.type", "value": "master thesis", "language": null, "element": "type", "qualifier": null, "schema": "dc"}, {"key": "dc.identifier.urn", "value": "URN:NBN:fi:jyu-202012117065", "language": "", "element": "identifier", "qualifier": "urn", "schema": "dc"}, {"key": "dc.type.ontasot", "value": "Pro gradu -tutkielma", "language": "fi", "element": "type", "qualifier": "ontasot", "schema": "dc"}, {"key": "dc.type.ontasot", "value": "Master\u2019s thesis", "language": "en", "element": "type", "qualifier": "ontasot", "schema": "dc"}, {"key": "dc.contributor.faculty", "value": "Informaatioteknologian tiedekunta", "language": "fi", "element": "contributor", "qualifier": "faculty", "schema": "dc"}, {"key": "dc.contributor.faculty", "value": "Faculty of Information Technology", "language": "en", "element": "contributor", "qualifier": "faculty", "schema": "dc"}, {"key": "dc.contributor.department", "value": "Informaatioteknologia", "language": "fi", "element": "contributor", "qualifier": "department", "schema": "dc"}, {"key": "dc.contributor.department", "value": "Information Technology", "language": "en", "element": "contributor", "qualifier": "department", "schema": "dc"}, {"key": "dc.contributor.organization", "value": "Jyv\u00e4skyl\u00e4n yliopisto", "language": "fi", "element": "contributor", "qualifier": "organization", "schema": "dc"}, {"key": "dc.contributor.organization", "value": "University of Jyv\u00e4skyl\u00e4", "language": "en", "element": "contributor", "qualifier": "organization", "schema": "dc"}, {"key": "dc.subject.discipline", "value": "Tietotekniikka", "language": "fi", "element": "subject", "qualifier": "discipline", "schema": "dc"}, {"key": "dc.subject.discipline", "value": "Mathematical Information Technology", "language": "en", "element": "subject", "qualifier": "discipline", "schema": "dc"}, {"key": "yvv.contractresearch.funding", "value": "0", "language": "", "element": "contractresearch", "qualifier": "funding", "schema": "yvv"}, {"key": "dc.type.coar", "value": "http://purl.org/coar/resource_type/c_bdcc", "language": null, "element": "type", "qualifier": "coar", "schema": "dc"}, {"key": "dc.rights.accesslevel", "value": "openAccess", "language": null, "element": "rights", "qualifier": "accesslevel", "schema": "dc"}, {"key": "dc.type.publication", "value": "masterThesis", "language": null, "element": "type", "qualifier": "publication", "schema": "dc"}, {"key": "dc.subject.oppiainekoodi", "value": "602", "language": "", "element": "subject", "qualifier": "oppiainekoodi", "schema": "dc"}, {"key": "dc.subject.yso", "value": "tiedonhaku", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.subject.yso", "value": "hakuohjelmat", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.subject.yso", "value": "koneoppiminen", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.subject.yso", "value": "big data", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.subject.yso", "value": "kyselykielet", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.subject.yso", "value": "tiedonhallinta", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.subject.yso", "value": "luonnollinen kieli", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.subject.yso", "value": "kieli ja kielet", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.subject.yso", "value": "Query", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.subject.yso", "value": "tiedonhakuj\u00e4rjestelm\u00e4t", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.subject.yso", "value": "information retrieval", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.subject.yso", "value": "search engines", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.subject.yso", "value": "machine learning", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.subject.yso", "value": "big data", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.subject.yso", "value": "query languages", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.subject.yso", "value": "information management", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.subject.yso", "value": "natural language", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.subject.yso", "value": "languages", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.subject.yso", "value": "Query", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.subject.yso", "value": "information retrieval systems", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.format.content", "value": "fulltext", "language": null, "element": "format", "qualifier": "content", "schema": "dc"}, {"key": "dc.rights.url", "value": "https://rightsstatements.org/page/InC/1.0/", "language": null, "element": "rights", "qualifier": "url", "schema": "dc"}, {"key": "dc.type.okm", "value": "G2", "language": null, "element": "type", "qualifier": "okm", "schema": "dc"}, {"key": "dc.description.accessibilityfeature", "value": "unknown accessibility", "language": "en", "element": "description", "qualifier": "accessibilityfeature", "schema": "dc"}, {"key": "dc.description.accessibilityfeature", "value": "ei tietoa saavutettavuudesta", "language": "fi", "element": "description", "qualifier": "accessibilityfeature", "schema": "dc"}]
id	jyx.123456789_73119
language	eng
last_indexed	2025-05-21T20:05:40Z
main_date	2020-01-01T00:00:00Z
main_date_str	2020
online_boolean	1
online_urls_str_mv	{"url":"https:\/\/jyx.jyu.fi\/bitstreams\/6bfabda9-5e92-420a-97ba-b3a4a9842dfa\/download","text":"URN:NBN:fi:jyu-202012117065.pdf","source":"jyx","mediaType":"application\/pdf"}
publishDate	2020
record_format	qdc
source_str_mv	jyx
spellingShingle	Ambaye, Michael Improving search engine results using different machine learning models and tools IBM Watson natural language query Watson discovery Tietotekniikka Mathematical Information Technology 602 tiedonhaku hakuohjelmat koneoppiminen big data kyselykielet tiedonhallinta luonnollinen kieli kieli ja kielet Query tiedonhakujärjestelmät information retrieval search engines machine learning query languages information management natural language languages information retrieval systems
title	Improving search engine results using different machine learning models and tools
title_full	Improving search engine results using different machine learning models and tools
title_fullStr	Improving search engine results using different machine learning models and tools Improving search engine results using different machine learning models and tools
title_full_unstemmed	Improving search engine results using different machine learning models and tools Improving search engine results using different machine learning models and tools
title_short	Improving search engine results using different machine learning models and tools
title_sort	improving search engine results using different machine learning models and tools
title_txtP	Improving search engine results using different machine learning models and tools
topic	IBM Watson natural language query Watson discovery Tietotekniikka Mathematical Information Technology 602 tiedonhaku hakuohjelmat koneoppiminen big data kyselykielet tiedonhallinta luonnollinen kieli kieli ja kielet Query tiedonhakujärjestelmät information retrieval search engines machine learning query languages information management natural language languages information retrieval systems
topic_facet	602 IBM Watson Mathematical Information Technology Query Tietotekniikka Watson discovery big data hakuohjelmat information management information retrieval information retrieval systems kieli ja kielet koneoppiminen kyselykielet languages luonnollinen kieli machine learning natural language natural language query query languages search engines tiedonhaku tiedonhakujärjestelmät tiedonhallinta
url	https://jyx.jyu.fi/handle/123456789/73119 http://www.urn.fi/URN:NBN:fi:jyu-202012117065
work_keys_str_mv	AT ambayemichael improvingsearchengineresultsusingdifferentmachinelearningmodelsandtools

Improving search engine results using different machine learning models and tools

Samankaltaisia teoksia