Part-of-speech tagging in written slang

Erilaiset kieliteknologiasovellukset ovat olleet jo vuosikymmeniä arkipäiväises-sä käytössä. Esimerkiksi ennustava tekstinsyöttö ja automaattinen korjaus ovat olleet käytössä jo vuosikymmeniä. Puheen tunnistus ja kielen automaattinen kääntäminen ovat puolestaan hieman uudempia sovelluksia. Tieteenal...

Full description

Bibliographic Details
Main Author: Korolainen, Valtteri
Other Authors: Informaatioteknologian tiedekunta, Faculty of Information Technology, Tietojenkäsittelytieteiden laitos, Department of Computer Science and Information Systems, University of Jyväskylä, Jyväskylän yliopisto
Format: Master's thesis
Language:eng
Published: 2014
Subjects:
Online Access: https://jyx.jyu.fi/handle/123456789/44127
_version_ 1826225781481668608
author Korolainen, Valtteri
author2 Informaatioteknologian tiedekunta Faculty of Information Technology Tietojenkäsittelytieteiden laitos Department of Computer Science and Information Systems University of Jyväskylä Jyväskylän yliopisto
author_facet Korolainen, Valtteri Informaatioteknologian tiedekunta Faculty of Information Technology Tietojenkäsittelytieteiden laitos Department of Computer Science and Information Systems University of Jyväskylä Jyväskylän yliopisto Korolainen, Valtteri Informaatioteknologian tiedekunta Faculty of Information Technology Tietojenkäsittelytieteiden laitos Department of Computer Science and Information Systems University of Jyväskylä Jyväskylän yliopisto
author_sort Korolainen, Valtteri
datasource_str_mv jyx
description Erilaiset kieliteknologiasovellukset ovat olleet jo vuosikymmeniä arkipäiväises-sä käytössä. Esimerkiksi ennustava tekstinsyöttö ja automaattinen korjaus ovat olleet käytössä jo vuosikymmeniä. Puheen tunnistus ja kielen automaattinen kääntäminen ovat puolestaan hieman uudempia sovelluksia. Tieteenalana kieli-teknologia on vuosikymmeniä vanha, mutta silti koneilla on vielä monesti vai-keuksia ymmärtää luonnollisia kieliä. Tämän tutkimuksen tavoite on kartoittaa koneiden kykyä annotoida tekstiä automaattisesti kun käsiteltävä aineisto sisäl-tää slangia. Tutkimus sisältää empiirisen kokeen automaattisten annotointialgo-rimien toiminnasta. Kielen prosessointi on myös nykyään käytössä olevilla al-goritmeilla verrattain raskasta. Osa sovelluksista voidaan kuitenkin suorittaa pilvipalveluissa. Eurooppalaisten kielien prosessointi nykyalgoritmeilla on koh-tuullisen hyvällä tasolla verrattuna moniin muihin kieliin. Tähän syynä on huomattavasti laajempi taustatyö. Vaikka monet sovellukset onnistuisivat usein ymmärtämään luonnollista yleiskieltä, niin slangin prosessointi on huomatta-vasti hankalampaa. Pääsyyt slangin prosessoinnin haasteellisuudelle ovat slan-gitutkimuksen vähäisyys kieliteknologioihin liittyen sekä slangin monesti kompleksisempi luonne. Automaattinen simultaanitulkkaus on jo jossain mää-rin mahdollista nykyaikaisilla kieliteknologiasovelluksilla. Yksi tapa arvioida tiettyä kieliteknologiaa on analysoida taustalla olevaa sanaluokkajäsentäjää, jonka tehtävä on annotoida tekstifragmentteja. Tämän tutkimuksen tutkimus-ongelmana on selvittää n-gram algoritmin suorityskyky muihin käytössä ole-viin algoritmeihin nähden slangia annotoitaessa. Tilastollisia lähestymistapoja käytettäessä myös taustalla oleva manuaalisen jäsentämisen laajuus vaikuttaa merkittävästi sanaluokkajäsentäjän toimintaan. Eurooppalaiset kielet voidaan prosessoida monesti luotettavammin tilastollisilla menetelmillä, kun taas esi-merkiksi Etelä-Intian kielet, kuten Hindi, ovat monesti luotettavampia proses-soida sääntöihin perustuvilla menetelmillä. Englanninkieli voidaan luonnolli-sessa muodossaan annotoida automaattisesti 97% tarkkudella; englanninkieli-sen slangin automaattinen annotointi saavuttaa puolestaan vain 93% tarkkusta-son. Tutkimustuloksista voidaan todeta, että vaikka algoritmin valinta vaikut-taa osaltaan annotoinnin tarkkuuteen, niin sääntöihin perustuvat menetelmät ovat tärkeä lisä slangin annotoinnissa. Tärkein sääntöihin perustuva lisämene-telmä on sanojen klusterointi. Contemporary computers have different capabilities to process natural lan-guages. For example speech recognition and machine translation are both due to study of natural language processing (NLP). Still, machines have some prob-lems of understanding a natural language since words can be ambiguous. Most of the time machines are able to understand the single words. Complete sen-tences are causing more problems. As well, a part of the actual language proc-essing is moved to cloud from local machines due to heavy algorithms that have a high time or space compelexity. English and other European languages have better success rate in NLP solutions than other languages. Mainly this is because of the amount of work and prior analysis done around the language. Even though variety of different NLP solutions exists, they are mainly focused on standard language. Our research contains empirical study which goal is to describe n-gram algorithm suitability in automatic slang annotation context. Slang processing is more problematic than processing standard language, which can be seen in lower accuracy rates. Some of the problems are caused lack of extensive slang analysis when on the other hand some problems are due to complexity of slang. Simultaneous interpreter is one possible solution of up-coming NLP innovations but it has limitations since slang processing is still partly under a development. One way to analyze lingual capabilities of a ma-chine is to evaluate the success rate of Part-of-Speech (POS) tagging. The re-search problem is how n-gram algorithms are performing in slang tagging compared to previously experimented algorithms. As a result of this study it is been found that tagging algorithm selection is in major part of tagger accuracy. In statistical approaches corpus size is remarkably affecting the accuracy as well. Languages are performing differently with different algorithms. For instance, statistical tagging algorithms are mostly having better accuracies in European languages while rule based tagging algorithms are outperforming statistical taggers in South Indian languages. From the POS tagging point of view English slang can be considered as different language from Standard English. While Standard English text can be automatically tagged with success rate of 97% the slang taggers are only fairly reaching 93% success rate. As a conclusion for re-search findings, rule-based approaches are important addition for slang POS taggers. Most important of these kinds of tools is word clustering.
first_indexed 2024-09-11T08:49:55Z
format Pro gradu
free_online_boolean 1
fullrecord [{"key": "dc.contributor.author", "value": "Korolainen, Valtteri", "language": null, "element": "contributor", "qualifier": "author", "schema": "dc"}, {"key": "dc.date.accessioned", "value": "2014-08-28T05:58:02Z", "language": "", "element": "date", "qualifier": "accessioned", "schema": "dc"}, {"key": "dc.date.available", "value": "2014-08-28T05:58:02Z", "language": "", "element": "date", "qualifier": "available", "schema": "dc"}, {"key": "dc.date.issued", "value": "2014", "language": null, "element": "date", "qualifier": "issued", "schema": "dc"}, {"key": "dc.identifier.other", "value": "oai:jykdok.linneanet.fi:1444778", "language": null, "element": "identifier", "qualifier": "other", "schema": "dc"}, {"key": "dc.identifier.uri", "value": "https://jyx.jyu.fi/handle/123456789/44127", "language": "", "element": "identifier", "qualifier": "uri", "schema": "dc"}, {"key": "dc.description.abstract", "value": "Erilaiset kieliteknologiasovellukset ovat olleet jo vuosikymmeni\u00e4 arkip\u00e4iv\u00e4ises-s\u00e4 k\u00e4yt\u00f6ss\u00e4. Esimerkiksi ennustava tekstinsy\u00f6tt\u00f6 ja automaattinen korjaus ovat olleet k\u00e4yt\u00f6ss\u00e4 jo vuosikymmeni\u00e4. Puheen tunnistus ja kielen automaattinen k\u00e4\u00e4nt\u00e4minen ovat puolestaan hieman uudempia sovelluksia. Tieteenalana kieli-teknologia on vuosikymmeni\u00e4 vanha, mutta silti koneilla on viel\u00e4 monesti vai-keuksia ymm\u00e4rt\u00e4\u00e4 luonnollisia kieli\u00e4. T\u00e4m\u00e4n tutkimuksen tavoite on kartoittaa koneiden kyky\u00e4 annotoida teksti\u00e4 automaattisesti kun k\u00e4sitelt\u00e4v\u00e4 aineisto sis\u00e4l-t\u00e4\u00e4 slangia. Tutkimus sis\u00e4lt\u00e4\u00e4 empiirisen kokeen automaattisten annotointialgo-rimien toiminnasta. Kielen prosessointi on my\u00f6s nyky\u00e4\u00e4n k\u00e4yt\u00f6ss\u00e4 olevilla al-goritmeilla verrattain raskasta. Osa sovelluksista voidaan kuitenkin suorittaa pilvipalveluissa. Eurooppalaisten kielien prosessointi nykyalgoritmeilla on koh-tuullisen hyv\u00e4ll\u00e4 tasolla verrattuna moniin muihin kieliin. T\u00e4h\u00e4n syyn\u00e4 on huomattavasti laajempi taustaty\u00f6. Vaikka monet sovellukset onnistuisivat usein ymm\u00e4rt\u00e4m\u00e4\u00e4n luonnollista yleiskielt\u00e4, niin slangin prosessointi on huomatta-vasti hankalampaa. P\u00e4\u00e4syyt slangin prosessoinnin haasteellisuudelle ovat slan-gitutkimuksen v\u00e4h\u00e4isyys kieliteknologioihin liittyen sek\u00e4 slangin monesti kompleksisempi luonne. Automaattinen simultaanitulkkaus on jo jossain m\u00e4\u00e4-rin mahdollista nykyaikaisilla kieliteknologiasovelluksilla. Yksi tapa arvioida tietty\u00e4 kieliteknologiaa on analysoida taustalla olevaa sanaluokkaj\u00e4sent\u00e4j\u00e4\u00e4, jonka teht\u00e4v\u00e4 on annotoida tekstifragmentteja. T\u00e4m\u00e4n tutkimuksen tutkimus-ongelmana on selvitt\u00e4\u00e4 n-gram algoritmin suorityskyky muihin k\u00e4yt\u00f6ss\u00e4 ole-viin algoritmeihin n\u00e4hden slangia annotoitaessa. Tilastollisia l\u00e4hestymistapoja k\u00e4ytett\u00e4ess\u00e4 my\u00f6s taustalla oleva manuaalisen j\u00e4sent\u00e4misen laajuus vaikuttaa merkitt\u00e4v\u00e4sti sanaluokkaj\u00e4sent\u00e4j\u00e4n toimintaan. Eurooppalaiset kielet voidaan prosessoida monesti luotettavammin tilastollisilla menetelmill\u00e4, kun taas esi-merkiksi Etel\u00e4-Intian kielet, kuten Hindi, ovat monesti luotettavampia proses-soida s\u00e4\u00e4nt\u00f6ihin perustuvilla menetelmill\u00e4. Englanninkieli voidaan luonnolli-sessa muodossaan annotoida automaattisesti 97% tarkkudella; englanninkieli-sen slangin automaattinen annotointi saavuttaa puolestaan vain 93% tarkkusta-son. Tutkimustuloksista voidaan todeta, ett\u00e4 vaikka algoritmin valinta vaikut-taa osaltaan annotoinnin tarkkuuteen, niin s\u00e4\u00e4nt\u00f6ihin perustuvat menetelm\u00e4t ovat t\u00e4rke\u00e4 lis\u00e4 slangin annotoinnissa. T\u00e4rkein s\u00e4\u00e4nt\u00f6ihin perustuva lis\u00e4mene-telm\u00e4 on sanojen klusterointi.", "language": "fi", "element": "description", "qualifier": "abstract", "schema": "dc"}, {"key": "dc.description.abstract", "value": "Contemporary computers have different capabilities to process natural lan-guages. For example speech recognition and machine translation are both due to study of natural language processing (NLP). Still, machines have some prob-lems of understanding a natural language since words can be ambiguous. Most of the time machines are able to understand the single words. Complete sen-tences are causing more problems. As well, a part of the actual language proc-essing is moved to cloud from local machines due to heavy algorithms that have a high time or space compelexity. English and other European languages have better success rate in NLP solutions than other languages. Mainly this is because of the amount of work and prior analysis done around the language. Even though variety of different NLP solutions exists, they are mainly focused on standard language. Our research contains empirical study which goal is to describe n-gram algorithm suitability in automatic slang annotation context. Slang processing is more problematic than processing standard language, which can be seen in lower accuracy rates. Some of the problems are caused lack of extensive slang analysis when on the other hand some problems are due to complexity of slang. Simultaneous interpreter is one possible solution of up-coming NLP innovations but it has limitations since slang processing is still partly under a development. One way to analyze lingual capabilities of a ma-chine is to evaluate the success rate of Part-of-Speech (POS) tagging. The re-search problem is how n-gram algorithms are performing in slang tagging compared to previously experimented algorithms. As a result of this study it is been found that tagging algorithm selection is in major part of tagger accuracy. In statistical approaches corpus size is remarkably affecting the accuracy as well. Languages are performing differently with different algorithms. For instance, statistical tagging algorithms are mostly having better accuracies in European languages while rule based tagging algorithms are outperforming statistical taggers in South Indian languages. From the POS tagging point of view English slang can be considered as different language from Standard English. While Standard English text can be automatically tagged with success rate of 97% the slang taggers are only fairly reaching 93% success rate. As a conclusion for re-search findings, rule-based approaches are important addition for slang POS taggers. Most important of these kinds of tools is word clustering.", "language": "en", "element": "description", "qualifier": "abstract", "schema": "dc"}, {"key": "dc.description.provenance", "value": "Submitted using Plone Publishing form by Valtteri Korolainen (juvakoro) on 2014-08-28 05:58:01.301258. Form: Pro gradu -lomake (https://kirjasto.jyu.fi/julkaisut/julkaisulomakkeet/pro-gradu-lomake). JyX data: [jyx_publishing-allowed (fi) =True]", "language": "en", "element": "description", "qualifier": "provenance", "schema": "dc"}, {"key": "dc.description.provenance", "value": "Submitted by jyx lomake-julkaisija (jyx-julkaisija@noreply.fi) on 2014-08-28T05:58:02Z\r\nNo. of bitstreams: 2\r\nURN:NBN:fi:jyu-201408282684.pdf: 4926493 bytes, checksum: 6ce3f00373c1e5a96fe6e31594b37fea (MD5)\r\nlicense.html: 4807 bytes, checksum: 337422d4fb3330df8a596cfdc94b1d53 (MD5)", "language": "en", "element": "description", "qualifier": "provenance", "schema": "dc"}, {"key": "dc.description.provenance", "value": "Made available in DSpace on 2014-08-28T05:58:02Z (GMT). No. of bitstreams: 2\r\nURN:NBN:fi:jyu-201408282684.pdf: 4926493 bytes, checksum: 6ce3f00373c1e5a96fe6e31594b37fea (MD5)\r\nlicense.html: 4807 bytes, checksum: 337422d4fb3330df8a596cfdc94b1d53 (MD5)\r\n Previous issue date: 2014", "language": "en", "element": "description", "qualifier": "provenance", "schema": "dc"}, {"key": "dc.format.extent", "value": "1 verkkoaineisto.", "language": null, "element": "format", "qualifier": "extent", "schema": "dc"}, {"key": "dc.format.mimetype", "value": "application/pdf", "language": null, "element": "format", "qualifier": "mimetype", "schema": "dc"}, {"key": "dc.language.iso", "value": "eng", "language": null, "element": "language", "qualifier": "iso", "schema": "dc"}, {"key": "dc.rights", "value": "In Copyright", "language": "en", "element": "rights", "qualifier": null, "schema": "dc"}, {"key": "dc.subject.other", "value": "Part-of-Speech tagging", "language": "", "element": "subject", "qualifier": "other", "schema": "dc"}, {"key": "dc.subject.other", "value": "Hidden-Markov Model", "language": "", "element": "subject", "qualifier": "other", "schema": "dc"}, {"key": "dc.subject.other", "value": "Natural Language Processing", "language": "", "element": "subject", "qualifier": "other", "schema": "dc"}, {"key": "dc.subject.other", "value": "Algorithms", "language": "", "element": "subject", "qualifier": "other", "schema": "dc"}, {"key": "dc.subject.other", "value": "Machine Learning", "language": "", "element": "subject", "qualifier": "other", "schema": "dc"}, {"key": "dc.subject.other", "value": "Language Technologies", "language": "", "element": "subject", "qualifier": "other", "schema": "dc"}, {"key": "dc.title", "value": "Part-of-speech tagging in written slang", "language": null, "element": "title", "qualifier": null, "schema": "dc"}, {"key": "dc.type", "value": "master thesis", "language": null, "element": "type", "qualifier": null, "schema": "dc"}, {"key": "dc.identifier.urn", "value": "URN:NBN:fi:jyu-201408282684", "language": null, "element": "identifier", "qualifier": "urn", "schema": "dc"}, {"key": "dc.type.ontasot", "value": "Pro gradu -tutkielma", "language": "fi", "element": "type", "qualifier": "ontasot", "schema": "dc"}, {"key": "dc.type.ontasot", "value": "Master\u2019s thesis", "language": "en", "element": "type", "qualifier": "ontasot", "schema": "dc"}, {"key": "dc.contributor.faculty", "value": "Informaatioteknologian tiedekunta", "language": "fi", "element": "contributor", "qualifier": "faculty", "schema": "dc"}, {"key": "dc.contributor.faculty", "value": "Faculty of Information Technology", "language": "en", "element": "contributor", "qualifier": "faculty", "schema": "dc"}, {"key": "dc.contributor.department", "value": "Tietojenk\u00e4sittelytieteiden laitos", "language": "fi", "element": "contributor", "qualifier": "department", "schema": "dc"}, {"key": "dc.contributor.department", "value": "Department of Computer Science and Information Systems", "language": "en", "element": "contributor", "qualifier": "department", "schema": "dc"}, {"key": "dc.contributor.organization", "value": "University of Jyv\u00e4skyl\u00e4", "language": "en", "element": "contributor", "qualifier": "organization", "schema": "dc"}, {"key": "dc.contributor.organization", "value": "Jyv\u00e4skyl\u00e4n yliopisto", "language": "fi", "element": "contributor", "qualifier": "organization", "schema": "dc"}, {"key": "dc.subject.discipline", "value": "Tietojenk\u00e4sittelytiede", "language": "fi", "element": "subject", "qualifier": "discipline", "schema": "dc"}, {"key": "dc.date.updated", "value": "2014-08-28T05:58:03Z", "language": "", "element": "date", "qualifier": "updated", "schema": "dc"}, {"key": "dc.type.coar", "value": "http://purl.org/coar/resource_type/c_bdcc", "language": null, "element": "type", "qualifier": "coar", "schema": "dc"}, {"key": "dc.rights.accesslevel", "value": "openAccess", "language": null, "element": "rights", "qualifier": "accesslevel", "schema": "dc"}, {"key": "dc.type.publication", "value": "masterThesis", "language": null, "element": "type", "qualifier": "publication", "schema": "dc"}, {"key": "dc.subject.oppiainekoodi", "value": "601", "language": null, "element": "subject", "qualifier": "oppiainekoodi", "schema": "dc"}, {"key": "dc.subject.yso", "value": "kieliteknologia", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.subject.yso", "value": "koneoppiminen", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.subject.yso", "value": "algoritmit", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.subject.yso", "value": "luonnollinen kieli", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.format.content", "value": "fulltext", "language": null, "element": "format", "qualifier": "content", "schema": "dc"}, {"key": "dc.rights.url", "value": "https://rightsstatements.org/page/InC/1.0/", "language": null, "element": "rights", "qualifier": "url", "schema": "dc"}, {"key": "dc.type.okm", "value": "G2", "language": null, "element": "type", "qualifier": "okm", "schema": "dc"}]
id jyx.123456789_44127
language eng
last_indexed 2025-02-18T10:56:22Z
main_date 2014-01-01T00:00:00Z
main_date_str 2014
online_boolean 1
online_urls_str_mv {"url":"https:\/\/jyx.jyu.fi\/bitstreams\/7479029d-b5e0-4946-90e2-89f60defd212\/download","text":"URN:NBN:fi:jyu-201408282684.pdf","source":"jyx","mediaType":"application\/pdf"}
publishDate 2014
record_format qdc
source_str_mv jyx
spellingShingle Korolainen, Valtteri Part-of-speech tagging in written slang Part-of-Speech tagging Hidden-Markov Model Natural Language Processing Algorithms Machine Learning Language Technologies Tietojenkäsittelytiede 601 kieliteknologia koneoppiminen algoritmit luonnollinen kieli
title Part-of-speech tagging in written slang
title_full Part-of-speech tagging in written slang
title_fullStr Part-of-speech tagging in written slang Part-of-speech tagging in written slang
title_full_unstemmed Part-of-speech tagging in written slang Part-of-speech tagging in written slang
title_short Part-of-speech tagging in written slang
title_sort part of speech tagging in written slang
title_txtP Part-of-speech tagging in written slang
topic Part-of-Speech tagging Hidden-Markov Model Natural Language Processing Algorithms Machine Learning Language Technologies Tietojenkäsittelytiede 601 kieliteknologia koneoppiminen algoritmit luonnollinen kieli
topic_facet 601 Algorithms Hidden-Markov Model Language Technologies Machine Learning Natural Language Processing Part-of-Speech tagging Tietojenkäsittelytiede algoritmit kieliteknologia koneoppiminen luonnollinen kieli
url https://jyx.jyu.fi/handle/123456789/44127 http://www.urn.fi/URN:NBN:fi:jyu-201408282684
work_keys_str_mv AT korolainenvaltteri partofspeechtagginginwrittenslang