Architecture-independent matching of stripped binary code files using BERT and a Siamese neural network

The proliferation of IoT devices brings many cyber security challenges. Identifying executable code with known vulnerabilities is one of them, this despite the fact that open source code is commonly used in IoT firmware. Factors that contribute to this challenge include the high usage of heterogeneo...

Full description

Bibliographic Details
Main Author: Lampinen, Kenneth
Other Authors: Informaatioteknologian tiedekunta, Faculty of Information Technology, Informaatioteknologia, Information Technology, Jyväskylän yliopisto, University of Jyväskylä
Format: Master's thesis
Language:eng
Published: 2020
Subjects:
Online Access: https://jyx.jyu.fi/handle/123456789/73442
_version_ 1826225693407576064
author Lampinen, Kenneth
author2 Informaatioteknologian tiedekunta Faculty of Information Technology Informaatioteknologia Information Technology Jyväskylän yliopisto University of Jyväskylä
author_facet Lampinen, Kenneth Informaatioteknologian tiedekunta Faculty of Information Technology Informaatioteknologia Information Technology Jyväskylän yliopisto University of Jyväskylä Lampinen, Kenneth Informaatioteknologian tiedekunta Faculty of Information Technology Informaatioteknologia Information Technology Jyväskylän yliopisto University of Jyväskylä
author_sort Lampinen, Kenneth
datasource_str_mv jyx
description The proliferation of IoT devices brings many cyber security challenges. Identifying executable code with known vulnerabilities is one of them, this despite the fact that open source code is commonly used in IoT firmware. Factors that contribute to this challenge include the high usage of heterogeneous architectures, as well as non-standard toolsets and compilers when developing IoT firmware. To address this issue, this work examines the latest research in bi-nary code matching. It concludes that the research does not adequately address the current cyber security issues incurred by IoT devices and proposes a new method of binary code matching based on techniques and methods commonly seen in Natural Language Processing (NLP). An artefact using Google’s BERT and a custom bi-directional LSTM Siamese network is developed and tested to demonstrate the viability of this new method. The BERT model was pre-trained using the code sections of binary executables compiled for the ARM architecture. It achieved scores of 89.1% and 98.0% in the key metrics of masked_lm_accuracy and next_sentence_accuracy respectively. This pre-trained BERT model was used to extract embeddings from the binary files’ code sections in order to train and validate the Siamese network. The Siamese network achieved an average rate of approximately 80% on the task of match-ing the stripped code sections of binary files compiled by two separate open source projects. This compares favorably to the 0% accuracy achieved by the fuzzy hashing algorithms SSDEEP and SDHASH.
first_indexed 2024-09-11T08:50:02Z
format Pro gradu
fullrecord [{"key": "dc.contributor.advisor", "value": "Costin, Andrei", "language": "", "element": "contributor", "qualifier": "advisor", "schema": "dc"}, {"key": "dc.contributor.author", "value": "Lampinen, Kenneth", "language": "", "element": "contributor", "qualifier": "author", "schema": "dc"}, {"key": "dc.date.accessioned", "value": "2020-12-28T08:35:28Z", "language": null, "element": "date", "qualifier": "accessioned", "schema": "dc"}, {"key": "dc.date.available", "value": "2020-12-28T08:35:28Z", "language": null, "element": "date", "qualifier": "available", "schema": "dc"}, {"key": "dc.date.issued", "value": "2020", "language": "", "element": "date", "qualifier": "issued", "schema": "dc"}, {"key": "dc.identifier.uri", "value": "https://jyx.jyu.fi/handle/123456789/73442", "language": null, "element": "identifier", "qualifier": "uri", "schema": "dc"}, {"key": "dc.description.abstract", "value": "The proliferation of IoT devices brings many cyber security challenges. Identifying executable code with known vulnerabilities is one of them, this despite the fact that open source code is commonly used in IoT firmware. Factors that contribute to this challenge include the high usage of heterogeneous architectures, as well as non-standard toolsets and compilers when developing IoT firmware. To address this issue, this work examines the latest research in bi-nary code matching. It concludes that the research does not adequately address the current cyber security issues incurred by IoT devices and proposes a new method of binary code matching based on techniques and methods commonly seen in Natural Language Processing (NLP). An artefact using Google\u2019s BERT and a custom bi-directional LSTM Siamese network is developed and tested to demonstrate the viability of this new method. The BERT model was pre-trained using the code sections of binary executables compiled for the ARM architecture. It achieved scores of 89.1% and 98.0% in the key metrics of masked_lm_accuracy and next_sentence_accuracy respectively. This pre-trained BERT model was used to extract embeddings from the binary files\u2019 code sections in order to train and validate the Siamese network. The Siamese network achieved an average rate of approximately 80% on the task of match-ing the stripped code sections of binary files compiled by two separate open source projects. This compares favorably to the 0% accuracy achieved by the fuzzy hashing algorithms SSDEEP and SDHASH.", "language": "en", "element": "description", "qualifier": "abstract", "schema": "dc"}, {"key": "dc.description.provenance", "value": "Submitted by Paivi Vuorio (paelvuor@jyu.fi) on 2020-12-28T08:35:28Z\nNo. of bitstreams: 0", "language": "en", "element": "description", "qualifier": "provenance", "schema": "dc"}, {"key": "dc.description.provenance", "value": "Made available in DSpace on 2020-12-28T08:35:28Z (GMT). No. of bitstreams: 0\n Previous issue date: 2020", "language": "en", "element": "description", "qualifier": "provenance", "schema": "dc"}, {"key": "dc.format.extent", "value": "99", "language": "", "element": "format", "qualifier": "extent", "schema": "dc"}, {"key": "dc.format.mimetype", "value": "application/pdf", "language": null, "element": "format", "qualifier": "mimetype", "schema": "dc"}, {"key": "dc.language.iso", "value": "eng", "language": null, "element": "language", "qualifier": "iso", "schema": "dc"}, {"key": "dc.rights", "value": "In Copyright", "language": "en", "element": "rights", "qualifier": null, "schema": "dc"}, {"key": "dc.subject.other", "value": "binary file matching", "language": "", "element": "subject", "qualifier": "other", "schema": "dc"}, {"key": "dc.subject.other", "value": "deep learning", "language": "", "element": "subject", "qualifier": "other", "schema": "dc"}, {"key": "dc.subject.other", "value": "Natural Language Processing", "language": "", "element": "subject", "qualifier": "other", "schema": "dc"}, {"key": "dc.subject.other", "value": "NLP", "language": "", "element": "subject", "qualifier": "other", "schema": "dc"}, {"key": "dc.subject.other", "value": "BERT", "language": "", "element": "subject", "qualifier": "other", "schema": "dc"}, {"key": "dc.subject.other", "value": "transformer", "language": "", "element": "subject", "qualifier": "other", "schema": "dc"}, {"key": "dc.subject.other", "value": "LSTM", "language": "", "element": "subject", "qualifier": "other", "schema": "dc"}, {"key": "dc.subject.other", "value": "Siamese network", "language": "", "element": "subject", "qualifier": "other", "schema": "dc"}, {"key": "dc.subject.other", "value": "similarity detection", "language": "", "element": "subject", "qualifier": "other", "schema": "dc"}, {"key": "dc.subject.other", "value": "SSDEEP", "language": "", "element": "subject", "qualifier": "other", "schema": "dc"}, {"key": "dc.subject.other", "value": "SDHASH", "language": "", "element": "subject", "qualifier": "other", "schema": "dc"}, {"key": "dc.title", "value": "Architecture-independent matching of stripped binary code files using BERT and a Siamese neural network", "language": "", "element": "title", "qualifier": null, "schema": "dc"}, {"key": "dc.type", "value": "master thesis", "language": null, "element": "type", "qualifier": null, "schema": "dc"}, {"key": "dc.identifier.urn", "value": "URN:NBN:fi:jyu-202012287374", "language": "", "element": "identifier", "qualifier": "urn", "schema": "dc"}, {"key": "dc.type.ontasot", "value": "Pro gradu -tutkielma", "language": "fi", "element": "type", "qualifier": "ontasot", "schema": "dc"}, {"key": "dc.type.ontasot", "value": "Master\u2019s thesis", "language": "en", "element": "type", "qualifier": "ontasot", "schema": "dc"}, {"key": "dc.contributor.faculty", "value": "Informaatioteknologian tiedekunta", "language": "fi", "element": "contributor", "qualifier": "faculty", "schema": "dc"}, {"key": "dc.contributor.faculty", "value": "Faculty of Information Technology", "language": "en", "element": "contributor", "qualifier": "faculty", "schema": "dc"}, {"key": "dc.contributor.department", "value": "Informaatioteknologia", "language": "fi", "element": "contributor", "qualifier": "department", "schema": "dc"}, {"key": "dc.contributor.department", "value": "Information Technology", "language": "en", "element": "contributor", "qualifier": "department", "schema": "dc"}, {"key": "dc.contributor.organization", "value": "Jyv\u00e4skyl\u00e4n yliopisto", "language": "fi", "element": "contributor", "qualifier": "organization", "schema": "dc"}, {"key": "dc.contributor.organization", "value": "University of Jyv\u00e4skyl\u00e4", "language": "en", "element": "contributor", "qualifier": "organization", "schema": "dc"}, {"key": "dc.subject.discipline", "value": "Tietojenk\u00e4sittelytiede", "language": "fi", "element": "subject", "qualifier": "discipline", "schema": "dc"}, {"key": "dc.subject.discipline", "value": "Computer Science", "language": "en", "element": "subject", "qualifier": "discipline", "schema": "dc"}, {"key": "yvv.contractresearch.funding", "value": "0", "language": "", "element": "contractresearch", "qualifier": "funding", "schema": "yvv"}, {"key": "dc.type.coar", "value": "http://purl.org/coar/resource_type/c_bdcc", "language": null, "element": "type", "qualifier": "coar", "schema": "dc"}, {"key": "dc.rights.accesslevel", "value": "restrictedAccess", "language": null, "element": "rights", "qualifier": "accesslevel", "schema": "dc"}, {"key": "dc.type.publication", "value": "masterThesis", "language": null, "element": "type", "qualifier": "publication", "schema": "dc"}, {"key": "dc.subject.oppiainekoodi", "value": "601", "language": "", "element": "subject", "qualifier": "oppiainekoodi", "schema": "dc"}, {"key": "dc.subject.yso", "value": "kyberturvallisuus", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.subject.yso", "value": "koneoppiminen", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.subject.yso", "value": "esineiden internet", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.subject.yso", "value": "cyber security", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.subject.yso", "value": "machine learning", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.subject.yso", "value": "Internet of things", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.format.content", "value": "fulltext", "language": null, "element": "format", "qualifier": "content", "schema": "dc"}, {"key": "dc.rights.url", "value": "https://rightsstatements.org/page/InC/1.0/", "language": null, "element": "rights", "qualifier": "url", "schema": "dc"}, {"key": "dc.rights.accessrights", "value": "Tekij\u00e4 ei ole antanut lupaa avoimeen julkaisuun, joten aineisto on luettavissa vain Jyv\u00e4skyl\u00e4n yliopiston kirjaston <a href=\"https://kirjasto.jyu.fi/fi/tyoskentelytilat/laitteet-ja-tilat\">arkistoty\u00f6asemalta</a>.", "language": "fi", "element": "rights", "qualifier": "accessrights", "schema": "dc"}, {"key": "dc.rights.accessrights", "value": "<br><br>The author has not given permission to make the work publicly available electronically. Therefore the material can be read only at the archival <a href=\"https://kirjasto.jyu.fi/en/workspaces/facilities\">workstation</a> at Jyv\u00e4skyl\u00e4 University Library reserved for the use of archival materials.", "language": "en", "element": "rights", "qualifier": "accessrights", "schema": "dc"}, {"key": "dc.type.okm", "value": "G2", "language": null, "element": "type", "qualifier": "okm", "schema": "dc"}]
id jyx.123456789_73442
language eng
last_indexed 2025-02-18T10:56:23Z
main_date 2020-01-01T00:00:00Z
main_date_str 2020
publishDate 2020
record_format qdc
source_str_mv jyx
spellingShingle Lampinen, Kenneth Architecture-independent matching of stripped binary code files using BERT and a Siamese neural network binary file matching deep learning Natural Language Processing NLP BERT transformer LSTM Siamese network similarity detection SSDEEP SDHASH Tietojenkäsittelytiede Computer Science 601 kyberturvallisuus koneoppiminen esineiden internet cyber security machine learning Internet of things
title Architecture-independent matching of stripped binary code files using BERT and a Siamese neural network
title_full Architecture-independent matching of stripped binary code files using BERT and a Siamese neural network
title_fullStr Architecture-independent matching of stripped binary code files using BERT and a Siamese neural network Architecture-independent matching of stripped binary code files using BERT and a Siamese neural network
title_full_unstemmed Architecture-independent matching of stripped binary code files using BERT and a Siamese neural network Architecture-independent matching of stripped binary code files using BERT and a Siamese neural network
title_short Architecture-independent matching of stripped binary code files using BERT and a Siamese neural network
title_sort architecture independent matching of stripped binary code files using bert and a siamese neural network
title_txtP Architecture-independent matching of stripped binary code files using BERT and a Siamese neural network
topic binary file matching deep learning Natural Language Processing NLP BERT transformer LSTM Siamese network similarity detection SSDEEP SDHASH Tietojenkäsittelytiede Computer Science 601 kyberturvallisuus koneoppiminen esineiden internet cyber security machine learning Internet of things
topic_facet 601 BERT Computer Science Internet of things LSTM NLP Natural Language Processing SDHASH SSDEEP Siamese network Tietojenkäsittelytiede binary file matching cyber security deep learning esineiden internet koneoppiminen kyberturvallisuus machine learning similarity detection transformer
url https://jyx.jyu.fi/handle/123456789/73442 http://www.urn.fi/URN:NBN:fi:jyu-202012287374
work_keys_str_mv AT lampinenkenneth architectureindependentmatchingofstrippedbinarycodefilesusingbertandasiameseneur