Evaluating the Effectiveness of LLMs and Prompting Techniques in Generating Data Quality Rules

This thesis explores the use of generative artificial intelligence (GenAI), specifically large language models (LLMs), to automate the creation of data quality (DQ) rules. Traditional rule-based systems are difficult to scale in large and dynamic data environments. To address this, the study evaluat...

Täydet tiedot

Bibliografiset tiedot
Päätekijä: Siyam, Sohag
Muut tekijät: Informaatioteknologian tiedekunta, Faculty of Information Technology, Jyväskylän yliopisto, University of Jyväskylä
Aineistotyyppi: Pro gradu
Kieli:eng
Julkaistu: 2025
Aiheet:
Linkit: https://jyx.jyu.fi/handle/123456789/102965
_version_ 1834494319148400640
author Siyam, Sohag
author2 Informaatioteknologian tiedekunta Faculty of Information Technology Jyväskylän yliopisto University of Jyväskylä
author_facet Siyam, Sohag Informaatioteknologian tiedekunta Faculty of Information Technology Jyväskylän yliopisto University of Jyväskylä Siyam, Sohag Informaatioteknologian tiedekunta Faculty of Information Technology Jyväskylän yliopisto University of Jyväskylä
author_sort Siyam, Sohag
datasource_str_mv jyx
description This thesis explores the use of generative artificial intelligence (GenAI), specifically large language models (LLMs), to automate the creation of data quality (DQ) rules. Traditional rule-based systems are difficult to scale in large and dynamic data environments. To address this, the study evaluates three LLMs: GPT-4 Turbo, Gemini 1.5 Pro, and Claude 3.7 Sonnet, using three prompting strategies: zero-shot, few-shot, and prompt-chaining. A total of 216 rule sets were generated from metadata and profiling inputs and evaluated by domain experts. Results show that prompt-chaining significantly improves rule quality over standalone prompting strategies, while model choice has a minor impact. The best-performing combination (Claude with prompt-chaining) achieved high-quality outputs. These findings demonstrate that GenAI can support scalable and adaptive DQ rule generation when paired with effective prompt design, offering a practical solution for enterprise data monitoring.
first_indexed 2025-06-02T20:00:55Z
format Pro gradu
fullrecord [{"key": "dc.contributor.advisor", "value": "Khriyenko, Oleksiy", "language": null, "element": "contributor", "qualifier": "advisor", "schema": "dc"}, {"key": "dc.contributor.author", "value": "Siyam, Sohag", "language": null, "element": "contributor", "qualifier": "author", "schema": "dc"}, {"key": "dc.date.accessioned", "value": "2025-06-02T12:06:50Z", "language": null, "element": "date", "qualifier": "accessioned", "schema": "dc"}, {"key": "dc.date.available", "value": "2025-06-02T12:06:50Z", "language": null, "element": "date", "qualifier": "available", "schema": "dc"}, {"key": "dc.date.issued", "value": "2025", "language": null, "element": "date", "qualifier": "issued", "schema": "dc"}, {"key": "dc.identifier.uri", "value": "https://jyx.jyu.fi/handle/123456789/102965", "language": null, "element": "identifier", "qualifier": "uri", "schema": "dc"}, {"key": "dc.description.abstract", "value": "This thesis explores the use of generative artificial intelligence (GenAI), specifically large\nlanguage models (LLMs), to automate the creation of data quality (DQ) rules. Traditional\nrule-based systems are difficult to scale in large and dynamic data environments. To\naddress this, the study evaluates three LLMs: GPT-4 Turbo, Gemini 1.5 Pro, and Claude\n3.7 Sonnet, using three prompting strategies: zero-shot, few-shot, and prompt-chaining.\nA total of 216 rule sets were generated from metadata and profiling inputs and evaluated\nby domain experts. Results show that prompt-chaining significantly improves rule quality\nover standalone prompting strategies, while model choice has a minor impact. The\nbest-performing combination (Claude with prompt-chaining) achieved high-quality\noutputs.\nThese findings demonstrate that GenAI can support scalable and adaptive DQ rule\ngeneration when paired with effective prompt design, offering a practical solution for\nenterprise data monitoring.", "language": "en", "element": "description", "qualifier": "abstract", "schema": "dc"}, {"key": "dc.description.provenance", "value": "Submitted by jyx lomake-julkaisija (jyx-julkaisija.group@korppi.jyu.fi) on 2025-06-02T12:06:50Z\nNo. of bitstreams: 0", "language": "en", "element": "description", "qualifier": "provenance", "schema": "dc"}, {"key": "dc.description.provenance", "value": "Made available in DSpace on 2025-06-02T12:06:50Z (GMT). No. of bitstreams: 0", "language": "en", "element": "description", "qualifier": "provenance", "schema": "dc"}, {"key": "dc.format.extent", "value": "37", "language": null, "element": "format", "qualifier": "extent", "schema": "dc"}, {"key": "dc.format.mimetype", "value": "application/pdf", "language": null, "element": "format", "qualifier": "mimetype", "schema": "dc"}, {"key": "dc.language.iso", "value": "eng", "language": null, "element": "language", "qualifier": "iso", "schema": "dc"}, {"key": "dc.rights", "value": "In Copyright", "language": null, "element": "rights", "qualifier": null, "schema": "dc"}, {"key": "dc.title", "value": "Evaluating the Effectiveness of LLMs and Prompting Techniques in Generating Data Quality Rules", "language": null, "element": "title", "qualifier": null, "schema": "dc"}, {"key": "dc.type", "value": "master thesis", "language": null, "element": "type", "qualifier": null, "schema": "dc"}, {"key": "dc.identifier.urn", "value": "URN:NBN:fi:jyu-202506024774", "language": null, "element": "identifier", "qualifier": "urn", "schema": "dc"}, {"key": "dc.contributor.faculty", "value": "Informaatioteknologian tiedekunta", "language": "fi", "element": "contributor", "qualifier": "faculty", "schema": "dc"}, {"key": "dc.contributor.faculty", "value": "Faculty of Information Technology", "language": "en", "element": "contributor", "qualifier": "faculty", "schema": "dc"}, {"key": "dc.contributor.organization", "value": "Jyv\u00e4skyl\u00e4n yliopisto", "language": "fi", "element": "contributor", "qualifier": "organization", "schema": "dc"}, {"key": "dc.contributor.organization", "value": "University of Jyv\u00e4skyl\u00e4", "language": "en", "element": "contributor", "qualifier": "organization", "schema": "dc"}, {"key": "dc.subject.discipline", "value": "Master's Degree Programme in Artificial Intelligence", "language": "fi", "element": "subject", "qualifier": "discipline", "schema": "dc"}, {"key": "dc.subject.discipline", "value": "Master's Degree Programme in Artificial Intelligence", "language": "en", "element": "subject", "qualifier": "discipline", "schema": "dc"}, {"key": "dc.type.coar", "value": "http://purl.org/coar/resource_type/c_bdcc", "language": null, "element": "type", "qualifier": "coar", "schema": "dc"}, {"key": "dc.rights.copyright", "value": "\u00a9 The Author(s)", "language": null, "element": "rights", "qualifier": "copyright", "schema": "dc"}, {"key": "dc.rights.accesslevel", "value": "restrictedAccess", "language": null, "element": "rights", "qualifier": "accesslevel", "schema": "dc"}, {"key": "dc.type.publication", "value": "masterThesis", "language": null, "element": "type", "qualifier": "publication", "schema": "dc"}, {"key": "dc.format.content", "value": "fulltext", "language": null, "element": "format", "qualifier": "content", "schema": "dc"}, {"key": "dc.rights.url", "value": "https://rightsstatements.org/page/InC/1.0/", "language": null, "element": "rights", "qualifier": "url", "schema": "dc"}, {"key": "dc.rights.accessrights", "value": "Tekij\u00e4 ei ole antanut lupaa avoimeen julkaisuun, joten aineisto on luettavissa vain Jyv\u00e4skyl\u00e4n yliopiston kirjaston arkistoty\u00f6semalta. Ks. https://www.jyu.fi/fi/osc/kirjasto/tyoskentelytilat/laitteet-ja-tilat#toc-jyx-ty-asema.", "language": "fi", "element": "rights", "qualifier": "accessrights", "schema": "dc"}, {"key": "dc.rights.accessrights", "value": "The author has not given permission to make the work publicly available electronically. Therefore the material can be read only at the archival workstation at Jyv\u00e4skyl\u00e4 University Library (https://www.jyu.fi/en/osc/library/workspaces/facilities-and-equipment#toc-jyx-workstation).", "language": "en", "element": "rights", "qualifier": "accessrights", "schema": "dc"}, {"key": "dc.description.accessibilityfeature", "value": "ei tietoa saavutettavuudesta", "language": "fi", "element": "description", "qualifier": "accessibilityfeature", "schema": "dc"}, {"key": "dc.description.accessibilityfeature", "value": "unknown accessibility", "language": "en", "element": "description", "qualifier": "accessibilityfeature", "schema": "dc"}]
id jyx.123456789_102965
language eng
last_indexed 2025-06-02T20:02:29Z
main_date 2025-01-01T00:00:00Z
main_date_str 2025
publishDate 2025
record_format qdc
source_str_mv jyx
spellingShingle Siyam, Sohag Evaluating the Effectiveness of LLMs and Prompting Techniques in Generating Data Quality Rules Master's Degree Programme in Artificial Intelligence
title Evaluating the Effectiveness of LLMs and Prompting Techniques in Generating Data Quality Rules
title_full Evaluating the Effectiveness of LLMs and Prompting Techniques in Generating Data Quality Rules
title_fullStr Evaluating the Effectiveness of LLMs and Prompting Techniques in Generating Data Quality Rules Evaluating the Effectiveness of LLMs and Prompting Techniques in Generating Data Quality Rules
title_full_unstemmed Evaluating the Effectiveness of LLMs and Prompting Techniques in Generating Data Quality Rules Evaluating the Effectiveness of LLMs and Prompting Techniques in Generating Data Quality Rules
title_short Evaluating the Effectiveness of LLMs and Prompting Techniques in Generating Data Quality Rules
title_sort evaluating the effectiveness of llms and prompting techniques in generating data quality rules
title_txtP Evaluating the Effectiveness of LLMs and Prompting Techniques in Generating Data Quality Rules
topic Master's Degree Programme in Artificial Intelligence
topic_facet Master's Degree Programme in Artificial Intelligence
url https://jyx.jyu.fi/handle/123456789/102965 http://www.urn.fi/URN:NBN:fi:jyu-202506024774
work_keys_str_mv AT siyamsohag evaluatingtheeffectivenessofllmsandpromptingtechniquesingeneratingdataqualityrules