Comparing the forecasting performance of logistic regression and random forest models in criminal recidivism

Rikosseuraamusalalla on viime vuosina kehitetty uusintarikollisuutta ennustavia malleja (Tyni, 2015), jotka perustuvat tyypillisesti rekisteripohjaisiin mittareihin, jotka mittaavat mm. tuomitun sukupuolta, ikää, rikostaustaa ja vankikertaisuutta. Yleensä tällaisten mallien kehityksessä käytetään lo...

Full description

Bibliographic Details
Main Author: Aaltonen, Olli-Pekka
Other Authors: Faculty of Information Technology, Informaatioteknologian tiedekunta, Tietojenkäsittelytieteiden laitos, Department of Computer Science and Information Systems, University of Jyväskylä, Jyväskylän yliopisto
Format: Master's thesis
Language:eng
Published: 2016
Subjects:
Online Access: https://jyx.jyu.fi/handle/123456789/51967
_version_ 1826225752049188864
author Aaltonen, Olli-Pekka
author2 Faculty of Information Technology Informaatioteknologian tiedekunta Tietojenkäsittelytieteiden laitos Department of Computer Science and Information Systems University of Jyväskylä Jyväskylän yliopisto
author_facet Aaltonen, Olli-Pekka Faculty of Information Technology Informaatioteknologian tiedekunta Tietojenkäsittelytieteiden laitos Department of Computer Science and Information Systems University of Jyväskylä Jyväskylän yliopisto Aaltonen, Olli-Pekka Faculty of Information Technology Informaatioteknologian tiedekunta Tietojenkäsittelytieteiden laitos Department of Computer Science and Information Systems University of Jyväskylä Jyväskylän yliopisto
author_sort Aaltonen, Olli-Pekka
datasource_str_mv jyx
description Rikosseuraamusalalla on viime vuosina kehitetty uusintarikollisuutta ennustavia malleja (Tyni, 2015), jotka perustuvat tyypillisesti rekisteripohjaisiin mittareihin, jotka mittaavat mm. tuomitun sukupuolta, ikää, rikostaustaa ja vankikertaisuutta. Yleensä tällaisten mallien kehityksessä käytetään logistisen regressioanalyysin kaltaisia parametrisia malleja, joissa uusintarikollisuuden todennäköisyyttä mallinnetaan taustamuuttujien lineaarisena funktiona. Näiden mallien rinnalle on viime aikoina kehitetty koneoppimisalgoritmeihin perustuvia vaihtoehtoja, joiden on todettu suoriutuvan käytännön sovelluksissa uusintarikollisuuden ennustamisessa perinteisiä malleja paremmin (Berk & Bleich, 2014). Tällaisten mallien toimivuutta suhteessa perinteisiin malleihin ei ole kuitenkaan testattu suomalaisella datalla. Tutkielman tarkoituksena on tarkastella sitä, kuinka hyvin erilaiset ennustemallit onnistuvat tehtävässään. Tutkielman ensimmäisessä vaiheessa luodaan logistiseen regressioanalyysiin ja koneoppimisalgoritmiin (Random forest) perustuvat uusintarikollisuutta ennustavat mallit Kriminologian ja oikeuspolitiikan instituutin Rikosten ja seuraamusten tutkimusrekisteristä poimitulla aineistolla, joka sisältää referenssituomioita vuosilta 2005-2007. Tuomituille henkilöille on haettu tietoa myös referenssituomiota edeltävästä ja seuraavasta rikoskäyttäytymisestä. Ennustemalli luodaan vuosien 2005–2006 välillä tuomittujen aineistolla, ja ennustemallia testataan vuoden 2007 datalla. Näin simuloidaan tilannetta, jossa havaittuun aineistoon perustuvalla historiallisella toteumatiedolla ennustetaan uuden tuomittujen ryhmän vielä toteutumatonta uusintarikollisuutta. Tutkimuskysymyksenä kysytäänkin, kumpi malleista pystyy luomaan rikoshistoriatiedon perusteella paremman ennustusmallin. Molemmat mallit ennustavat uusinta-rikollisuutta tutkielman asetelmassa verrattain hyvin. Kumpikaan ennustemalli ei kuitenkaan ole toista parempi, sillä menetelmät tuottavat ennustustehokkuudeltaan varsin samantasoiset mallit. Tutkielman tuloksena todetaan, ettei Random forest –koneoppimismenetelmän ja logistisen regressiomallin ennustustehokkuuden välille saada merkittävää eroa tutkielman asetelmalla. During the recent years, predictive models have been created to predict the future criminal behavior (recidivism) of past offenders (e.g. Tyni, 2015). Predictive models are often created by using register-based indicators, e.g. offender’s gender, age, criminal background, or prior imprisonments. Usually, these predictive models are created by using parametric models, where the likelihood of recidivating is modelled as a linear function of independent variables. Lately, machine learning algorithms have been introduced as alternatives to these more traditional models. In a recent American study, machine learning algorithms were stated to be more accurate predictors of recidivism than the more traditional logistic regression model (Berk & Bleich, 2014). However, these machine learning algorithms have not been tested for criminal recidivism prediction utilizing Finnish data. The aim of this thesis is to examine the comparative effectiveness of different risk prediction models in a Finnish setting. In this thesis, two predictive models for recidivism are created, one being a logistic regression model, and the other a machine learning algorithm-based model called Random forest. Research data was gathered from the RST (Rikosten ja seuraamusten tutkimusrekisteri, which translates to “the research register of crimes and sanctions”) database of Institute of Criminology and Legal Policy, and includes all offenders convicted to several common crime type offenses in Finland from 2005 to 2007. Data also includes information on past and future criminal behavior for those offenders. Predictive models are developed with data from the years 2005 and 2006. The model testing is done with the remaining 2007 data, in order to simulate a situation where predictive models are used to predict recidivism yet to be actualized. The research question asks which of these models perform better in forecasting the criminal recidivism of a previous offender. The results of this study show that both logistic regression and Random forest algorithm create decent predictive models, but neither model outperforms the other on chosen performance metrics. The outcome, and the answer to the research question is, that neither model is better than the other in predicting recidivism among convicted offenders in Finland.
first_indexed 2024-09-11T08:51:53Z
format Pro gradu
fullrecord [{"key": "dc.contributor.advisor", "value": "Veijalainen, Jari", "language": "", "element": "contributor", "qualifier": "advisor", "schema": "dc"}, {"key": "dc.contributor.author", "value": "Aaltonen, Olli-Pekka", "language": null, "element": "contributor", "qualifier": "author", "schema": "dc"}, {"key": "dc.date.accessioned", "value": "2016-11-23T08:10:17Z", "language": "", "element": "date", "qualifier": "accessioned", "schema": "dc"}, {"key": "dc.date.available", "value": "2016-11-23T08:10:17Z", "language": "", "element": "date", "qualifier": "available", "schema": "dc"}, {"key": "dc.date.issued", "value": "2016", "language": null, "element": "date", "qualifier": "issued", "schema": "dc"}, {"key": "dc.identifier.other", "value": "oai:jykdok.linneanet.fi:1643461", "language": null, "element": "identifier", "qualifier": "other", "schema": "dc"}, {"key": "dc.identifier.uri", "value": "https://jyx.jyu.fi/handle/123456789/51967", "language": "", "element": "identifier", "qualifier": "uri", "schema": "dc"}, {"key": "dc.description.abstract", "value": "Rikosseuraamusalalla on viime vuosina kehitetty uusintarikollisuutta ennustavia malleja (Tyni, 2015), jotka perustuvat tyypillisesti rekisteripohjaisiin mittareihin, jotka mittaavat mm. tuomitun sukupuolta, ik\u00e4\u00e4, rikostaustaa ja vankikertaisuutta. Yleens\u00e4 t\u00e4llaisten mallien kehityksess\u00e4 k\u00e4ytet\u00e4\u00e4n logistisen regressioanalyysin kaltaisia parametrisia malleja, joissa uusintarikollisuuden todenn\u00e4k\u00f6isyytt\u00e4 mallinnetaan taustamuuttujien lineaarisena funktiona. N\u00e4iden mallien rinnalle on viime aikoina kehitetty koneoppimisalgoritmeihin perustuvia vaihtoehtoja, joiden on todettu suoriutuvan k\u00e4yt\u00e4nn\u00f6n sovelluksissa uusintarikollisuuden ennustamisessa perinteisi\u00e4 malleja paremmin (Berk & Bleich, 2014). T\u00e4llaisten mallien toimivuutta suhteessa perinteisiin malleihin ei ole kuitenkaan testattu suomalaisella datalla. Tutkielman tarkoituksena on tarkastella sit\u00e4, kuinka hyvin erilaiset ennustemallit onnistuvat teht\u00e4v\u00e4ss\u00e4\u00e4n. Tutkielman ensimm\u00e4isess\u00e4 vaiheessa luodaan logistiseen regressioanalyysiin ja koneoppimisalgoritmiin (Random forest) perustuvat uusintarikollisuutta ennustavat mallit Kriminologian ja oikeuspolitiikan instituutin Rikosten ja seuraamusten tutkimusrekisterist\u00e4 poimitulla aineistolla, joka sis\u00e4lt\u00e4\u00e4 referenssituomioita vuosilta 2005-2007. Tuomituille henkil\u00f6ille on haettu tietoa my\u00f6s referenssituomiota edelt\u00e4v\u00e4st\u00e4 ja seuraavasta rikosk\u00e4ytt\u00e4ytymisest\u00e4. Ennustemalli luodaan vuosien 2005\u20132006 v\u00e4lill\u00e4 tuomittujen aineistolla, ja ennustemallia testataan vuoden 2007 datalla. N\u00e4in simuloidaan tilannetta, jossa havaittuun aineistoon perustuvalla historiallisella toteumatiedolla ennustetaan uuden tuomittujen ryhm\u00e4n viel\u00e4 toteutumatonta uusintarikollisuutta. Tutkimuskysymyksen\u00e4 kysyt\u00e4\u00e4nkin, kumpi malleista pystyy luomaan rikoshistoriatiedon perusteella paremman ennustusmallin. Molemmat mallit ennustavat uusinta-rikollisuutta tutkielman asetelmassa verrattain hyvin. Kumpikaan ennustemalli ei kuitenkaan ole toista parempi, sill\u00e4 menetelm\u00e4t tuottavat ennustustehokkuudeltaan varsin samantasoiset mallit. Tutkielman tuloksena todetaan, ettei Random forest \u2013koneoppimismenetelm\u00e4n ja logistisen regressiomallin ennustustehokkuuden v\u00e4lille saada merkitt\u00e4v\u00e4\u00e4 eroa tutkielman asetelmalla.", "language": "fi", "element": "description", "qualifier": "abstract", "schema": "dc"}, {"key": "dc.description.abstract", "value": "During the recent years, predictive models have been created to predict the future criminal behavior (recidivism) of past offenders (e.g. Tyni, 2015). Predictive models are often created by using register-based indicators, e.g. offender\u2019s gender, age, criminal background, or prior imprisonments. Usually, these predictive models are created by using parametric models, where the likelihood of recidivating is modelled as a linear function of independent variables. Lately, machine learning algorithms have been introduced as alternatives to these more traditional models. In a recent American study, machine learning algorithms were stated to be more accurate predictors of recidivism than the more traditional logistic regression model (Berk & Bleich, 2014). However, these machine learning algorithms have not been tested for criminal recidivism prediction utilizing Finnish data. The aim of this thesis is to examine the comparative effectiveness of different risk prediction models in a Finnish setting. In this thesis, two predictive models for recidivism are created, one being a logistic regression model, and the other a machine learning algorithm-based model called Random forest. Research data was gathered from the RST (Rikosten ja seuraamusten tutkimusrekisteri, which translates to \u201cthe research register of crimes and sanctions\u201d) database of Institute of Criminology and Legal Policy, and includes all offenders convicted to several common crime type offenses in Finland from 2005 to 2007. Data also includes information on past and future criminal behavior for those offenders. Predictive models are developed with data from the years 2005 and 2006. The model testing is done with the remaining 2007 data, in order to simulate a situation where predictive models are used to predict recidivism yet to be actualized. The research question asks which of these models perform better in forecasting the criminal recidivism of a previous offender. The results of this study show that both logistic regression and Random forest algorithm create decent predictive models, but neither model outperforms the other on chosen performance metrics. The outcome, and the answer to the research question is, that neither model is better than the other in predicting recidivism among convicted offenders in Finland.", "language": "en", "element": "description", "qualifier": "abstract", "schema": "dc"}, {"key": "dc.description.provenance", "value": "Submitted using Plone Publishing form by Olli-Pekka Aaltonen (olaalton) on 2016-11-23 08:10:17.062676. Form: Pro gradu -lomake (https://kirjasto.jyu.fi/julkaisut/julkaisulomakkeet/pro-gradu-lomake). JyX data: [jyx_publishing-allowed (fi) =False]", "language": "en", "element": "description", "qualifier": "provenance", "schema": "dc"}, {"key": "dc.description.provenance", "value": "Submitted by jyx lomake-julkaisija (jyx-julkaisija.group@korppi.jyu.fi) on 2016-11-23T08:10:17Z\r\nNo. of bitstreams: 2\r\nURN:NBN:fi:jyu-201611234724.pdf: 912639 bytes, checksum: 8de7cf2a706d8694d423ee1027ee65eb (MD5)\r\nlicense.html: 1183 bytes, checksum: 9f694debb5b3d5bee5941d30880e4965 (MD5)", "language": "en", "element": "description", "qualifier": "provenance", "schema": "dc"}, {"key": "dc.description.provenance", "value": "Made available in DSpace on 2016-11-23T08:10:17Z (GMT). No. of bitstreams: 2\r\nURN:NBN:fi:jyu-201611234724.pdf: 912639 bytes, checksum: 8de7cf2a706d8694d423ee1027ee65eb (MD5)\r\nlicense.html: 1183 bytes, checksum: 9f694debb5b3d5bee5941d30880e4965 (MD5)\r\n Previous issue date: 2016", "language": "en", "element": "description", "qualifier": "provenance", "schema": "dc"}, {"key": "dc.format.extent", "value": "1 verkkoaineisto (53 sivua)", "language": null, "element": "format", "qualifier": "extent", "schema": "dc"}, {"key": "dc.format.mimetype", "value": "application/pdf", "language": null, "element": "format", "qualifier": "mimetype", "schema": "dc"}, {"key": "dc.language.iso", "value": "eng", "language": null, "element": "language", "qualifier": "iso", "schema": "dc"}, {"key": "dc.rights", "value": "In Copyright", "language": "en", "element": "rights", "qualifier": null, "schema": "dc"}, {"key": "dc.subject.other", "value": "Recidivism", "language": null, "element": "subject", "qualifier": "other", "schema": "dc"}, {"key": "dc.subject.other", "value": "machine learning", "language": null, "element": "subject", "qualifier": "other", "schema": "dc"}, {"key": "dc.subject.other", "value": "Random forest", "language": null, "element": "subject", "qualifier": "other", "schema": "dc"}, {"key": "dc.subject.other", "value": "logistic regression", "language": null, "element": "subject", "qualifier": "other", "schema": "dc"}, {"key": "dc.subject.other", "value": "forecasting", "language": null, "element": "subject", "qualifier": "other", "schema": "dc"}, {"key": "dc.title", "value": "Comparing the forecasting performance of logistic regression and random forest models in criminal recidivism", "language": null, "element": "title", "qualifier": null, "schema": "dc"}, {"key": "dc.type", "value": "master thesis", "language": null, "element": "type", "qualifier": null, "schema": "dc"}, {"key": "dc.identifier.urn", "value": "URN:NBN:fi:jyu-201611234724", "language": null, "element": "identifier", "qualifier": "urn", "schema": "dc"}, {"key": "dc.type.ontasot", "value": "Master\u2019s thesis", "language": "en", "element": "type", "qualifier": "ontasot", "schema": "dc"}, {"key": "dc.type.ontasot", "value": "Pro gradu -tutkielma", "language": "fi", "element": "type", "qualifier": "ontasot", "schema": "dc"}, {"key": "dc.contributor.faculty", "value": "Faculty of Information Technology", "language": "en", "element": "contributor", "qualifier": "faculty", "schema": "dc"}, {"key": "dc.contributor.faculty", "value": "Informaatioteknologian tiedekunta", "language": "fi", "element": "contributor", "qualifier": "faculty", "schema": "dc"}, {"key": "dc.contributor.department", "value": "Tietojenk\u00e4sittelytieteiden laitos", "language": "fi", "element": "contributor", "qualifier": "department", "schema": "dc"}, {"key": "dc.contributor.department", "value": "Department of Computer Science and Information Systems", "language": "en", "element": "contributor", "qualifier": "department", "schema": "dc"}, {"key": "dc.contributor.organization", "value": "University of Jyv\u00e4skyl\u00e4", "language": "en", "element": "contributor", "qualifier": "organization", "schema": "dc"}, {"key": "dc.contributor.organization", "value": "Jyv\u00e4skyl\u00e4n yliopisto", "language": "fi", "element": "contributor", "qualifier": "organization", "schema": "dc"}, {"key": "dc.subject.discipline", "value": "Information Systems Science", "language": "en", "element": "subject", "qualifier": "discipline", "schema": "dc"}, {"key": "dc.subject.discipline", "value": "Tietoj\u00e4rjestelm\u00e4tiede", "language": "fi", "element": "subject", "qualifier": "discipline", "schema": "dc"}, {"key": "dc.date.updated", "value": "2016-11-23T08:10:18Z", "language": "", "element": "date", "qualifier": "updated", "schema": "dc"}, {"key": "yvv.contractresearch.funding", "value": "0", "language": "", "element": "contractresearch", "qualifier": "funding", "schema": "yvv"}, {"key": "dc.type.coar", "value": "http://purl.org/coar/resource_type/c_bdcc", "language": null, "element": "type", "qualifier": "coar", "schema": "dc"}, {"key": "dc.rights.accesslevel", "value": "restrictedAccess", "language": "fi", "element": "rights", "qualifier": "accesslevel", "schema": "dc"}, {"key": "dc.type.publication", "value": "masterThesis", "language": null, "element": "type", "qualifier": "publication", "schema": "dc"}, {"key": "dc.subject.oppiainekoodi", "value": "601", "language": null, "element": "subject", "qualifier": "oppiainekoodi", "schema": "dc"}, {"key": "dc.subject.yso", "value": "uusintarikollisuus", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.subject.yso", "value": "ennusteet", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.subject.yso", "value": "regressioanalyysi", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.subject.yso", "value": "koneoppiminen", "language": null, "element": "subject", "qualifier": "yso", "schema": "dc"}, {"key": "dc.format.content", "value": "fulltext", "language": null, "element": "format", "qualifier": "content", "schema": "dc"}, {"key": "dc.rights.url", "value": "https://rightsstatements.org/page/InC/1.0/", "language": null, "element": "rights", "qualifier": "url", "schema": "dc"}, {"key": "dc.rights.accessrights", "value": "This material has a restricted access due to copyright reasons. It can be read at the workstation at Jyv\u00e4skyl\u00e4 University Library reserved for the use of archival materials: https://kirjasto.jyu.fi/en/workspaces/facilities.", "language": "en", "element": "rights", "qualifier": "accessrights", "schema": "dc"}, {"key": "dc.rights.accessrights", "value": "Aineistoon p\u00e4\u00e4sy\u00e4 on rajoitettu tekij\u00e4noikeussyist\u00e4. Aineisto on luettavissa Jyv\u00e4skyl\u00e4n yliopiston kirjaston arkistoty\u00f6asemalta. Ks. https://kirjasto.jyu.fi/fi/tyoskentelytilat/laitteet-ja-tilat.", "language": "fi", "element": "rights", "qualifier": "accessrights", "schema": "dc"}, {"key": "dc.type.okm", "value": "G2", "language": null, "element": "type", "qualifier": "okm", "schema": "dc"}]
id jyx.123456789_51967
language eng
last_indexed 2025-02-18T10:56:42Z
main_date 2016-01-01T00:00:00Z
main_date_str 2016
publishDate 2016
record_format qdc
source_str_mv jyx
spellingShingle Aaltonen, Olli-Pekka Comparing the forecasting performance of logistic regression and random forest models in criminal recidivism Recidivism machine learning Random forest logistic regression forecasting Information Systems Science Tietojärjestelmätiede 601 uusintarikollisuus ennusteet regressioanalyysi koneoppiminen
title Comparing the forecasting performance of logistic regression and random forest models in criminal recidivism
title_full Comparing the forecasting performance of logistic regression and random forest models in criminal recidivism
title_fullStr Comparing the forecasting performance of logistic regression and random forest models in criminal recidivism Comparing the forecasting performance of logistic regression and random forest models in criminal recidivism
title_full_unstemmed Comparing the forecasting performance of logistic regression and random forest models in criminal recidivism Comparing the forecasting performance of logistic regression and random forest models in criminal recidivism
title_short Comparing the forecasting performance of logistic regression and random forest models in criminal recidivism
title_sort comparing the forecasting performance of logistic regression and random forest models in criminal recidivism
title_txtP Comparing the forecasting performance of logistic regression and random forest models in criminal recidivism
topic Recidivism machine learning Random forest logistic regression forecasting Information Systems Science Tietojärjestelmätiede 601 uusintarikollisuus ennusteet regressioanalyysi koneoppiminen
topic_facet 601 Information Systems Science Random forest Recidivism Tietojärjestelmätiede ennusteet forecasting koneoppiminen logistic regression machine learning regressioanalyysi uusintarikollisuus
url https://jyx.jyu.fi/handle/123456789/51967 http://www.urn.fi/URN:NBN:fi:jyu-201611234724
work_keys_str_mv AT aaltonenollipekka comparingtheforecastingperformanceoflogisticregressionandrandomforestmodelsinc