Automaattisen verkkoharavoinnin menetelmät ja haasteet

Verkkoharavointi on tekniikka, jota käyttämällä voidaan kerätä tietoa internetistä ohjelmallisesti ja sitä voidaan hyödyntää moniin tieteellisiin ja kaupallisiin tarkoituksiin. Verkkoharavointiohjelmat voivat kuitenkin kohdata monenlaisia haasteita, jotka saattavat pakottaa kehittäjän päivittämään h...

Full description

Bibliographic Details
Main Author: Peltomaa, Olli
Other Authors: Informaatioteknologian tiedekunta, Faculty of Information Technology, Informaatioteknologia, Information Technology, Jyväskylän yliopisto, University of Jyväskylä
Format: Bachelor's thesis
Language:fin
Published: 2023
Subjects:
Online Access: https://jyx.jyu.fi/handle/123456789/86919
_version_ 1826225802488840192
author Peltomaa, Olli
author2 Informaatioteknologian tiedekunta Faculty of Information Technology Informaatioteknologia Information Technology Jyväskylän yliopisto University of Jyväskylä
author_facet Peltomaa, Olli Informaatioteknologian tiedekunta Faculty of Information Technology Informaatioteknologia Information Technology Jyväskylän yliopisto University of Jyväskylä Peltomaa, Olli Informaatioteknologian tiedekunta Faculty of Information Technology Informaatioteknologia Information Technology Jyväskylän yliopisto University of Jyväskylä
author_sort Peltomaa, Olli
datasource_str_mv jyx
description Verkkoharavointi on tekniikka, jota käyttämällä voidaan kerätä tietoa internetistä ohjelmallisesti ja sitä voidaan hyödyntää moniin tieteellisiin ja kaupallisiin tarkoituksiin. Verkkoharavointiohjelmat voivat kuitenkin kohdata monenlaisia haasteita, jotka saattavat pakottaa kehittäjän päivittämään haravointiohjelmaa toistuvasti. Kirjallisuuden perusteella käyttöliittymättömät selaimet ja koneoppimisalgoritmit tuottavat yhdessä parhaiten erilaisia haasteita sietävän ohjelman. Verkkoharavoinnin ala on altis nopeille muutoksille, mutta nykyisen kirjallisuuden perusteella koneoppimiseen perustuvissa algoritmeissa on kenties eniten tutkittavaa. Web scraping is a technique that can be used to gather information from the Internet programmatically and it can be used for many scientific and commercial purposes. However, web scrapers can face a variety of challenges that may force the developer to update the scraper repeatedly. Based on the literature, headless browsers and machine learning algorithms together produce the best scrapers that tolerates different challenges. The field of web scraping is prone to rapid changes, but based on the current literature, algorithms based on machine learning have perhaps the most research to do.
first_indexed 2023-05-12T20:00:34Z
format Kandityö
free_online_boolean 1
fullrecord [{"key": "dc.contributor.advisor", "value": "Saksa, Tytti", "language": "", "element": "contributor", "qualifier": "advisor", "schema": "dc"}, {"key": "dc.contributor.author", "value": "Peltomaa, Olli", "language": "", "element": "contributor", "qualifier": "author", "schema": "dc"}, {"key": "dc.date.accessioned", "value": "2023-05-12T05:20:41Z", "language": null, "element": "date", "qualifier": "accessioned", "schema": "dc"}, {"key": "dc.date.available", "value": "2023-05-12T05:20:41Z", "language": null, "element": "date", "qualifier": "available", "schema": "dc"}, {"key": "dc.date.issued", "value": "2023", "language": "", "element": "date", "qualifier": "issued", "schema": "dc"}, {"key": "dc.identifier.uri", "value": "https://jyx.jyu.fi/handle/123456789/86919", "language": null, "element": "identifier", "qualifier": "uri", "schema": "dc"}, {"key": "dc.description.abstract", "value": "Verkkoharavointi on tekniikka, jota k\u00e4ytt\u00e4m\u00e4ll\u00e4 voidaan ker\u00e4t\u00e4 tietoa internetist\u00e4 ohjelmallisesti ja sit\u00e4 voidaan hy\u00f6dynt\u00e4\u00e4 moniin tieteellisiin ja kaupallisiin tarkoituksiin. Verkkoharavointiohjelmat voivat kuitenkin kohdata monenlaisia haasteita, jotka saattavat pakottaa kehitt\u00e4j\u00e4n p\u00e4ivitt\u00e4m\u00e4\u00e4n haravointiohjelmaa toistuvasti. Kirjallisuuden perusteella k\u00e4ytt\u00f6liittym\u00e4tt\u00f6m\u00e4t selaimet ja koneoppimisalgoritmit tuottavat yhdess\u00e4 parhaiten erilaisia haasteita siet\u00e4v\u00e4n ohjelman. Verkkoharavoinnin ala on altis nopeille muutoksille, mutta nykyisen kirjallisuuden perusteella koneoppimiseen perustuvissa algoritmeissa on kenties eniten tutkittavaa.", "language": "fi", "element": "description", "qualifier": "abstract", "schema": "dc"}, {"key": "dc.description.abstract", "value": "Web scraping is a technique that can be used to gather information from the Internet programmatically and it can be used for many scientific and commercial purposes. However, web scrapers can face a variety of challenges that may force the developer to update the scraper repeatedly. Based on the literature, headless browsers and machine learning algorithms together produce the best scrapers that tolerates different challenges. The field of web scraping is prone to rapid changes, but based on the current literature, algorithms based on machine learning have perhaps the most research to do.", "language": "en", "element": "description", "qualifier": "abstract", "schema": "dc"}, {"key": "dc.description.provenance", "value": "Submitted by Paivi Vuorio (paelvuor@jyu.fi) on 2023-05-12T05:20:41Z\nNo. of bitstreams: 0", "language": "en", "element": "description", "qualifier": "provenance", "schema": "dc"}, {"key": "dc.description.provenance", "value": "Made available in DSpace on 2023-05-12T05:20:41Z (GMT). No. of bitstreams: 0\n Previous issue date: 2023", "language": "en", "element": "description", "qualifier": "provenance", "schema": "dc"}, {"key": "dc.format.extent", "value": "23", "language": "", "element": "format", "qualifier": "extent", "schema": "dc"}, {"key": "dc.language.iso", "value": "fin", "language": null, "element": "language", "qualifier": "iso", "schema": "dc"}, {"key": "dc.rights", "value": "In Copyright", "language": "en", "element": "rights", "qualifier": null, "schema": "dc"}, {"key": "dc.subject.other", "value": "verkkoharavointi", "language": "", "element": "subject", "qualifier": "other", "schema": "dc"}, {"key": "dc.subject.other", "value": "CAPTCHA", "language": "", "element": "subject", "qualifier": "other", "schema": "dc"}, {"key": "dc.subject.other", "value": "p\u00e4\u00e4t\u00f6n selain", "language": "", "element": "subject", "qualifier": "other", "schema": "dc"}, {"key": "dc.title", "value": "Automaattisen verkkoharavoinnin menetelm\u00e4t ja haasteet", "language": "", "element": "title", "qualifier": null, "schema": "dc"}, {"key": "dc.type", "value": "bachelor thesis", "language": null, "element": "type", "qualifier": null, "schema": "dc"}, {"key": "dc.identifier.urn", "value": "URN:NBN:fi:jyu-202305122996", "language": "", "element": "identifier", "qualifier": "urn", "schema": "dc"}, {"key": "dc.type.ontasot", "value": "Bachelor's thesis", "language": "en", "element": "type", "qualifier": "ontasot", "schema": "dc"}, {"key": "dc.type.ontasot", "value": "Kandidaatinty\u00f6", "language": "fi", "element": "type", "qualifier": "ontasot", "schema": "dc"}, {"key": "dc.contributor.faculty", "value": "Informaatioteknologian tiedekunta", "language": "fi", "element": "contributor", "qualifier": "faculty", "schema": "dc"}, {"key": "dc.contributor.faculty", "value": "Faculty of Information Technology", "language": "en", "element": "contributor", "qualifier": "faculty", "schema": "dc"}, {"key": "dc.contributor.department", "value": "Informaatioteknologia", "language": "fi", "element": "contributor", "qualifier": "department", "schema": "dc"}, {"key": "dc.contributor.department", "value": "Information Technology", "language": "en", "element": "contributor", "qualifier": "department", "schema": "dc"}, {"key": "dc.contributor.organization", "value": "Jyv\u00e4skyl\u00e4n yliopisto", "language": "fi", "element": "contributor", "qualifier": "organization", "schema": "dc"}, {"key": "dc.contributor.organization", "value": "University of Jyv\u00e4skyl\u00e4", "language": "en", "element": "contributor", "qualifier": "organization", "schema": "dc"}, {"key": "dc.subject.discipline", "value": "Tietotekniikka", "language": "fi", "element": "subject", "qualifier": "discipline", "schema": "dc"}, {"key": "dc.subject.discipline", "value": "Mathematical Information Technology", "language": "en", "element": "subject", "qualifier": "discipline", "schema": "dc"}, {"key": "yvv.contractresearch.funding", "value": "0", "language": "", "element": "contractresearch", "qualifier": "funding", "schema": "yvv"}, {"key": "dc.type.coar", "value": "http://purl.org/coar/resource_type/c_7a1f", "language": null, "element": "type", "qualifier": "coar", "schema": "dc"}, {"key": "dc.rights.accesslevel", "value": "openAccess", "language": null, "element": "rights", "qualifier": "accesslevel", "schema": "dc"}, {"key": "dc.type.publication", "value": "bachelorThesis", "language": null, "element": "type", "qualifier": "publication", "schema": "dc"}, {"key": "dc.subject.oppiainekoodi", "value": "602", "language": "", "element": "subject", "qualifier": "oppiainekoodi", "schema": "dc"}, {"key": "dc.subject.yso", "value": "WWW-sivut", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.subject.yso", "value": "Internet", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.subject.yso", "value": "tietotekniikka", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.subject.yso", "value": "tiedonhaku", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.rights.url", "value": "https://rightsstatements.org/page/InC/1.0/", "language": null, "element": "rights", "qualifier": "url", "schema": "dc"}]
id jyx.123456789_86919
language fin
last_indexed 2025-02-18T10:54:34Z
main_date 2023-01-01T00:00:00Z
main_date_str 2023
online_boolean 1
online_urls_str_mv {"url":"https:\/\/jyx.jyu.fi\/bitstreams\/bf850555-efa2-460f-bcbd-adeaeffc9e97\/download","text":"URN:NBN:fi:jyu-202305122996.pdf","source":"jyx","mediaType":"application\/pdf"}
publishDate 2023
record_format qdc
source_str_mv jyx
spellingShingle Peltomaa, Olli Automaattisen verkkoharavoinnin menetelmät ja haasteet verkkoharavointi CAPTCHA päätön selain Tietotekniikka Mathematical Information Technology 602 WWW-sivut Internet tietotekniikka tiedonhaku
title Automaattisen verkkoharavoinnin menetelmät ja haasteet
title_full Automaattisen verkkoharavoinnin menetelmät ja haasteet
title_fullStr Automaattisen verkkoharavoinnin menetelmät ja haasteet Automaattisen verkkoharavoinnin menetelmät ja haasteet
title_full_unstemmed Automaattisen verkkoharavoinnin menetelmät ja haasteet Automaattisen verkkoharavoinnin menetelmät ja haasteet
title_short Automaattisen verkkoharavoinnin menetelmät ja haasteet
title_sort automaattisen verkkoharavoinnin menetelmät ja haasteet
title_txtP Automaattisen verkkoharavoinnin menetelmät ja haasteet
topic verkkoharavointi CAPTCHA päätön selain Tietotekniikka Mathematical Information Technology 602 WWW-sivut Internet tietotekniikka tiedonhaku
topic_facet 602 CAPTCHA Internet Mathematical Information Technology Tietotekniikka WWW-sivut päätön selain tiedonhaku tietotekniikka verkkoharavointi
url https://jyx.jyu.fi/handle/123456789/86919 http://www.urn.fi/URN:NBN:fi:jyu-202305122996
work_keys_str_mv AT peltomaaolli automaattisenverkkoharavoinninmenetelmätjahaasteet