Approaches and challenges of automatic vulnerability classification using natural language processing and machine learning techniques

Automatisoitu haavoittuvuuksien etsiminen ja haavoittuvuuksien yksityiskohtien ennustaminen voi auttaa asiantuntijoita priorisoimaan ohjelmistovirheitä, joka voi johtaa nopeampaan virheenkorjaukseen. Tässä työssä käytettiin National Vulnerability Database -tietokantaa tutkittaessa kuinka haavoittuvu...

Full description

Bibliographic Details
Main Author: Jormakka, Ossi
Other Authors: Informaatioteknologian tiedekunta, Faculty of Information Technology, Informaatioteknologia, Information Technology, Jyväskylän yliopisto, University of Jyväskylä
Format: Master's thesis
Language:eng
Published: 2019
Subjects:
Online Access: https://jyx.jyu.fi/handle/123456789/66196
_version_ 1826225755506343936
author Jormakka, Ossi
author2 Informaatioteknologian tiedekunta Faculty of Information Technology Informaatioteknologia Information Technology Jyväskylän yliopisto University of Jyväskylä
author_facet Jormakka, Ossi Informaatioteknologian tiedekunta Faculty of Information Technology Informaatioteknologia Information Technology Jyväskylän yliopisto University of Jyväskylä Jormakka, Ossi Informaatioteknologian tiedekunta Faculty of Information Technology Informaatioteknologia Information Technology Jyväskylän yliopisto University of Jyväskylä
author_sort Jormakka, Ossi
datasource_str_mv jyx
description Automatisoitu haavoittuvuuksien etsiminen ja haavoittuvuuksien yksityiskohtien ennustaminen voi auttaa asiantuntijoita priorisoimaan ohjelmistovirheitä, joka voi johtaa nopeampaan virheenkorjaukseen. Tässä työssä käytettiin National Vulnerability Database -tietokantaa tutkittaessa kuinka haavoittuvuuskuvauksien perusteella voidaan havaita haavoittuvuuksia mistä tahansa tekstistä sekä ennustaa haavoittuvuuksien vakavuus ja haavoittuvuustyyppi. Common Vulnerability Scoring System -järjestelmä tarjoaa tavan mitata haavoittuvuuksien vakavuuksia. Common Weakness Enumeration -järjestelmä tarjoaa hierarkkisen luokittelun yleisiin haavoittuvuustyyppeihin. Olemassa olevat tutkimukset haavoittuvuuksien tekstiluokittelussa usein rajoittuvat kapeaan alueeseen, esimerkiksi vain johonkin Common Vulnerability Scoring System -järjestelmän versioon. Tämä työ antaa yleiskuvan virheraporttien luokittelusta sekä vakavuuden ja haavoittuvuustyypin ennustamisesta. Työssä pyrittiin käyttämään laajasti tunnettuja tekstin esikäsittelymenetelmiä sekä monia muita Scikit-learn -kirjaston tarjoamia luonnollisen tekstin käsittelyn vaihtoehtoja ja koneoppimismenetelmiä. Tulokset osoittavat 2-grammin avainsanapohjaisen menetelmän olevan yhtä tehokas kuin yhden luokan tukivektorikone kun esikäsittelynä käytetään Term Frequency – Inverse Document Frequency -painotusta ja sanojen taivutusmuotojen muuttamista perusmuotoon (lemmatizing). Haavoittuvuuksien vakavuuden ennustamisessa saadaan parempia tuloksia Common Vulnerability Scoring System -järjestelmän versiolle 2 kuin järjestelmän versiolle 3. Lineaarinen tukivekorikone saavutti korkeimman F1-tuloksen haavoittuvuuksien vakavuuden ja haavoittuvuustyypin luokittelussa. Lisäksi tässä työssä on yhteenveto uusimpaan National Vulnerability Database -tietokannan tietoon. Automated vulnerability detection and prediction of vulnerability details may help security specialists to prioritize bug reports and getting earlier fixes to security related software defects. This thesis is about finding vulnerable-like descriptions from any text and classifying vulnerability severities and weakness types. Vulnerability severities are measured using Common Vulnerability Scoring System. Common Weakness Enumeration is a hierarchical list of weakness types that each vulnerability can be classified to. The scoring and weakness type information for known vulnerabilities are available on National Vulnerability Database. Many existing research about vulnerability text-only classification is limited to a narrow area, for example, specific version of Common Vulnerability Scoring System. This thesis gives an overview of classifying bug reports with severities and weakness types altogether. The Scikitlearn library’s interfaces were used extensively to implement text preprocessing, machine learning classification, and experiment validation. Experiments include stemming, lemmatization, and numerous text vectorization options and algorithms provided by the library. The results show that the keyword-based classifier using word 2-grams works as well as One-class Support Vector Machine with lemmatizing using the Term Frequency–Inverse Document Frequency preprocessing method in vulnerability detection. Vulnerability severities can be predicted better for Common Vulnerability Scoring System version 2 than its version 3. The Linear Support Vector Machine classifier got the highest F1-score in predicting both Common Vulnerability Scoring System and Common Weakness Enumeration. This thesis also presents a summary on the latest data available on the National Vulnerability Database data feeds.
first_indexed 2019-11-06T21:03:06Z
format Pro gradu
free_online_boolean 1
fullrecord [{"key": "dc.contributor.advisor", "value": "Costin, Andrei", "language": "", "element": "contributor", "qualifier": "advisor", "schema": "dc"}, {"key": "dc.contributor.author", "value": "Jormakka, Ossi", "language": "", "element": "contributor", "qualifier": "author", "schema": "dc"}, {"key": "dc.date.accessioned", "value": "2019-11-06T06:51:08Z", "language": null, "element": "date", "qualifier": "accessioned", "schema": "dc"}, {"key": "dc.date.available", "value": "2019-11-06T06:51:08Z", "language": null, "element": "date", "qualifier": "available", "schema": "dc"}, {"key": "dc.date.issued", "value": "2019", "language": "", "element": "date", "qualifier": "issued", "schema": "dc"}, {"key": "dc.identifier.uri", "value": "https://jyx.jyu.fi/handle/123456789/66196", "language": null, "element": "identifier", "qualifier": "uri", "schema": "dc"}, {"key": "dc.description.abstract", "value": "Automatisoitu haavoittuvuuksien etsiminen ja haavoittuvuuksien yksityiskohtien ennustaminen voi auttaa asiantuntijoita priorisoimaan ohjelmistovirheit\u00e4,\njoka voi johtaa nopeampaan virheenkorjaukseen. T\u00e4ss\u00e4 ty\u00f6ss\u00e4 k\u00e4ytettiin National Vulnerability Database -tietokantaa tutkittaessa kuinka haavoittuvuuskuvauksien perusteella voidaan havaita haavoittuvuuksia mist\u00e4 tahansa tekstist\u00e4\nsek\u00e4 ennustaa haavoittuvuuksien vakavuus ja haavoittuvuustyyppi. Common\nVulnerability Scoring System -j\u00e4rjestelm\u00e4 tarjoaa tavan mitata haavoittuvuuksien vakavuuksia. Common Weakness Enumeration -j\u00e4rjestelm\u00e4 tarjoaa hierarkkisen luokittelun yleisiin haavoittuvuustyyppeihin. Olemassa olevat tutkimukset haavoittuvuuksien tekstiluokittelussa usein rajoittuvat kapeaan alueeseen,\nesimerkiksi vain johonkin Common Vulnerability Scoring System -j\u00e4rjestelm\u00e4n\nversioon. T\u00e4m\u00e4 ty\u00f6 antaa yleiskuvan virheraporttien luokittelusta sek\u00e4 vakavuuden ja haavoittuvuustyypin ennustamisesta. Ty\u00f6ss\u00e4 pyrittiin k\u00e4ytt\u00e4m\u00e4\u00e4n laajasti\ntunnettuja tekstin esik\u00e4sittelymenetelmi\u00e4 sek\u00e4 monia muita Scikit-learn -kirjaston tarjoamia luonnollisen tekstin k\u00e4sittelyn vaihtoehtoja ja koneoppimismenetelmi\u00e4.\nTulokset osoittavat 2-grammin avainsanapohjaisen menetelm\u00e4n olevan\nyht\u00e4 tehokas kuin yhden luokan tukivektorikone kun esik\u00e4sittelyn\u00e4 k\u00e4ytet\u00e4\u00e4n\nTerm Frequency \u2013 Inverse Document Frequency -painotusta ja sanojen taivutusmuotojen muuttamista perusmuotoon (lemmatizing). Haavoittuvuuksien vakavuuden ennustamisessa saadaan parempia tuloksia Common Vulnerability Scoring System -j\u00e4rjestelm\u00e4n versiolle 2 kuin j\u00e4rjestelm\u00e4n versiolle 3. Lineaarinen\ntukivekorikone saavutti korkeimman F1-tuloksen haavoittuvuuksien vakavuuden ja haavoittuvuustyypin luokittelussa. Lis\u00e4ksi t\u00e4ss\u00e4 ty\u00f6ss\u00e4 on yhteenveto uusimpaan National Vulnerability Database -tietokannan tietoon.", "language": "fi", "element": "description", "qualifier": "abstract", "schema": "dc"}, {"key": "dc.description.abstract", "value": "Automated vulnerability detection and prediction of vulnerability details may\nhelp security specialists to prioritize bug reports and getting earlier fixes to\nsecurity related software defects. This thesis is about finding vulnerable-like\ndescriptions from any text and classifying vulnerability severities and weakness\ntypes. Vulnerability severities are measured using Common Vulnerability\nScoring System. Common Weakness Enumeration is a hierarchical list of\nweakness types that each vulnerability can be classified to. The scoring and\nweakness type information for known vulnerabilities are available on National\nVulnerability Database. Many existing research about vulnerability text-only\nclassification is limited to a narrow area, for example, specific version of\nCommon Vulnerability Scoring System. This thesis gives an overview of\nclassifying bug reports with severities and weakness types altogether. The Scikitlearn library\u2019s interfaces were used extensively to implement text preprocessing,\nmachine learning classification, and experiment validation. Experiments include\nstemming, lemmatization, and numerous text vectorization options and\nalgorithms provided by the library.\nThe results show that the keyword-based classifier using word 2-grams\nworks as well as One-class Support Vector Machine with lemmatizing using the\nTerm Frequency\u2013Inverse Document Frequency preprocessing method in\nvulnerability detection. Vulnerability severities can be predicted better for\nCommon Vulnerability Scoring System version 2 than its version 3. The Linear\nSupport Vector Machine classifier got the highest F1-score in predicting both\nCommon Vulnerability Scoring System and Common Weakness Enumeration.\nThis thesis also presents a summary on the latest data available on the National\nVulnerability Database data feeds.", "language": "en", "element": "description", "qualifier": "abstract", "schema": "dc"}, {"key": "dc.description.provenance", "value": "Submitted by Paivi Vuorio (paelvuor@jyu.fi) on 2019-11-06T06:51:07Z\nNo. of bitstreams: 0", "language": "en", "element": "description", "qualifier": "provenance", "schema": "dc"}, {"key": "dc.description.provenance", "value": "Made available in DSpace on 2019-11-06T06:51:08Z (GMT). No. of bitstreams: 0\n Previous issue date: 2019", "language": "en", "element": "description", "qualifier": "provenance", "schema": "dc"}, {"key": "dc.format.extent", "value": "62", "language": "", "element": "format", "qualifier": "extent", "schema": "dc"}, {"key": "dc.format.mimetype", "value": "application/pdf", "language": null, "element": "format", "qualifier": "mimetype", "schema": "dc"}, {"key": "dc.language.iso", "value": "eng", "language": null, "element": "language", "qualifier": "iso", "schema": "dc"}, {"key": "dc.rights", "value": "In Copyright", "language": "en", "element": "rights", "qualifier": null, "schema": "dc"}, {"key": "dc.subject.other", "value": "common vulnerability scoring system", "language": "", "element": "subject", "qualifier": "other", "schema": "dc"}, {"key": "dc.subject.other", "value": "common weakness enumeration", "language": "", "element": "subject", "qualifier": "other", "schema": "dc"}, {"key": "dc.subject.other", "value": "Scikit-learn", "language": "", "element": "subject", "qualifier": "other", "schema": "dc"}, {"key": "dc.title", "value": "Approaches and challenges of automatic vulnerability classification using natural language processing and machine learning techniques", "language": "", "element": "title", "qualifier": null, "schema": "dc"}, {"key": "dc.type", "value": "master thesis", "language": null, "element": "type", "qualifier": null, "schema": "dc"}, {"key": "dc.identifier.urn", "value": "URN:NBN:fi:jyu-201911064740", "language": "", "element": "identifier", "qualifier": "urn", "schema": "dc"}, {"key": "dc.type.ontasot", "value": "Pro gradu -tutkielma", "language": "fi", "element": "type", "qualifier": "ontasot", "schema": "dc"}, {"key": "dc.type.ontasot", "value": "Master\u2019s thesis", "language": "en", "element": "type", "qualifier": "ontasot", "schema": "dc"}, {"key": "dc.contributor.faculty", "value": "Informaatioteknologian tiedekunta", "language": "fi", "element": "contributor", "qualifier": "faculty", "schema": "dc"}, {"key": "dc.contributor.faculty", "value": "Faculty of Information Technology", "language": "en", "element": "contributor", "qualifier": "faculty", "schema": "dc"}, {"key": "dc.contributor.department", "value": "Informaatioteknologia", "language": "fi", "element": "contributor", "qualifier": "department", "schema": "dc"}, {"key": "dc.contributor.department", "value": "Information Technology", "language": "en", "element": "contributor", "qualifier": "department", "schema": "dc"}, {"key": "dc.contributor.organization", "value": "Jyv\u00e4skyl\u00e4n yliopisto", "language": "fi", "element": "contributor", "qualifier": "organization", "schema": "dc"}, {"key": "dc.contributor.organization", "value": "University of Jyv\u00e4skyl\u00e4", "language": "en", "element": "contributor", "qualifier": "organization", "schema": "dc"}, {"key": "dc.subject.discipline", "value": "Tietojenk\u00e4sittelytiede", "language": "fi", "element": "subject", "qualifier": "discipline", "schema": "dc"}, {"key": "dc.subject.discipline", "value": "Computer Science", "language": "en", "element": "subject", "qualifier": "discipline", "schema": "dc"}, {"key": "yvv.contractresearch.funding", "value": "0", "language": "", "element": "contractresearch", "qualifier": "funding", "schema": "yvv"}, {"key": "dc.type.coar", "value": "http://purl.org/coar/resource_type/c_bdcc", "language": null, "element": "type", "qualifier": "coar", "schema": "dc"}, {"key": "dc.rights.accesslevel", "value": "openAccess", "language": null, "element": "rights", "qualifier": "accesslevel", "schema": "dc"}, {"key": "dc.type.publication", "value": "masterThesis", "language": null, "element": "type", "qualifier": "publication", "schema": "dc"}, {"key": "dc.subject.oppiainekoodi", "value": "601", "language": "", "element": "subject", "qualifier": "oppiainekoodi", "schema": "dc"}, {"key": "dc.subject.yso", "value": "koneoppiminen", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.subject.yso", "value": "luokitus (toiminta)", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.subject.yso", "value": "haavoittuvuus", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.subject.yso", "value": "datatiede", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.subject.yso", "value": "machine learning", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.subject.yso", "value": "classification", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.subject.yso", "value": "vulnerability", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.subject.yso", "value": "data science", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.format.content", "value": "fulltext", "language": null, "element": "format", "qualifier": "content", "schema": "dc"}, {"key": "dc.rights.url", "value": "https://rightsstatements.org/page/InC/1.0/", "language": null, "element": "rights", "qualifier": "url", "schema": "dc"}, {"key": "dc.type.okm", "value": "G2", "language": null, "element": "type", "qualifier": "okm", "schema": "dc"}]
id jyx.123456789_66196
language eng
last_indexed 2025-02-18T10:54:10Z
main_date 2019-01-01T00:00:00Z
main_date_str 2019
online_boolean 1
online_urls_str_mv {"url":"https:\/\/jyx.jyu.fi\/bitstreams\/80a1688d-d1ef-4858-a299-c142496e078d\/download","text":"URN:NBN:fi:jyu-201911064740.pdf","source":"jyx","mediaType":"application\/pdf"}
publishDate 2019
record_format qdc
source_str_mv jyx
spellingShingle Jormakka, Ossi Approaches and challenges of automatic vulnerability classification using natural language processing and machine learning techniques common vulnerability scoring system common weakness enumeration Scikit-learn Tietojenkäsittelytiede Computer Science 601 koneoppiminen luokitus (toiminta) haavoittuvuus datatiede machine learning classification vulnerability data science
title Approaches and challenges of automatic vulnerability classification using natural language processing and machine learning techniques
title_full Approaches and challenges of automatic vulnerability classification using natural language processing and machine learning techniques
title_fullStr Approaches and challenges of automatic vulnerability classification using natural language processing and machine learning techniques Approaches and challenges of automatic vulnerability classification using natural language processing and machine learning techniques
title_full_unstemmed Approaches and challenges of automatic vulnerability classification using natural language processing and machine learning techniques Approaches and challenges of automatic vulnerability classification using natural language processing and machine learning techniques
title_short Approaches and challenges of automatic vulnerability classification using natural language processing and machine learning techniques
title_sort approaches and challenges of automatic vulnerability classification using natural language processing and machine learning techniques
title_txtP Approaches and challenges of automatic vulnerability classification using natural language processing and machine learning techniques
topic common vulnerability scoring system common weakness enumeration Scikit-learn Tietojenkäsittelytiede Computer Science 601 koneoppiminen luokitus (toiminta) haavoittuvuus datatiede machine learning classification vulnerability data science
topic_facet 601 Computer Science Scikit-learn Tietojenkäsittelytiede classification common vulnerability scoring system common weakness enumeration data science datatiede haavoittuvuus koneoppiminen luokitus (toiminta) machine learning vulnerability
url https://jyx.jyu.fi/handle/123456789/66196 http://www.urn.fi/URN:NBN:fi:jyu-201911064740
work_keys_str_mv AT jormakkaossi approachesandchallengesofautomaticvulnerabilityclassificationusingnaturallanguagepr