| Nome: | Descrição: | Tamanho: | Formato: | |
|---|---|---|---|---|
| 1.27 MB | Adobe PDF |
Orientador(es)
Resumo(s)
Security in web applications is often compromised by poorly written code that is exploited by attackers. Source code vulnerability detection tools have been developed using
static analysis and machine learning techniques. The best performing tools seek for very
low false negative rates along with acceptable false positives. Static analysis requires
manual programming to identify vulnerabilities, depends on human expertise and is usually limited to a specific programming language. On the other hand, classical supervised
machine learning approaches previously used may be limited to identify zero-day vulnerabilities or prone to overfit due to limited available datasets.
This dissertation aims to develop a hybrid machine learning (ML) system for vulnerability detection of web applications. The system developed will use a combination
of static analysis and Natural Language Processing (NLP) techniques to identify functions related to vulnerabilities that will be used to build representative datasets. The
datasets will be used as input for unsupervised machine learning and other behaviour
based anomaly detection algorithms in order to signalize as suspicious the code snippets
under analysis. For these source code snippets, the system will aim to confirm which are
vulnerable and identify the type of vulnerability via supervised machine learning techniques. The dissertation explores a novel approach to vulnerability detection by combining unsupervised anomaly detection models with supervised machine learning and
Natural Language Processing techniques. Previous research in vulnerability detection has
primarily focused on either unsupervised or supervised methods, neglecting the potential benefits of a hybrid approach. The goal of this research is to investigate the efficacy
of hybrid architectures in identifying software vulnerabilities and to determine the optimal machine learning models and datasets for this purpose. The proposed hybrid model
consists of different layers. The first uses a One Class Support Vector Machine model
(OCSVM) to detect anomalies, the second employs a Random Forest Model to confirm
the presence of vulnerabilities on the anomalies. The type of vulnerability is classified by
a Logistic Regression Model that relies on the Doc2Vec model for feature extraction.
The research includes experimentation with various machine learning models and
datasets, evaluating simple binary features to more complex Doc2Vec embeddings. The
thesis demonstrates OCSVM’s suitability for semi-unsupervised anomaly detection, yielding promising results across various datasets. Additionally, the study assesses Random Forests’ effectiveness in classifying vulnerable source code snippets based on OCSVMdetected anomalies and validate the use NLP techniques for feature extraction of sourcecode snippets. Overall, the proposed hybrid model achieved an accuracy of 65%. Although these results seems to be low, this research offers a promising hybrid approach to
vulnerability detection, leveraging the strengths of unsupervised and supervised machine
learning models. The findings suggest opportunities for further enhancements and optimizations, paving the way for more effective software vulnerability detection systems.
Descrição
Tese de mestrado, Ciências de Dados, 2023, Universidade de Lisboa, Faculdade de Ciências
Palavras-chave
deteção de vulnerabilidades de web aprendizagem automática detecção de anomalias processamento de linguagem natural segurança de software Teses de mestrado - 2024
