A Hybrid Machine Learning System for Vulnerability Detection in Web Applications

Oliveira, Miguel César de Albuquerque

http://hdl.handle.net/10451/63629

Utilize este identificador para referenciar este registo.

Nome:	Descrição:	Tamanho:	Formato:
TM_Miguel_Oliveira.pdf		1.27 MB	Adobe PDF	Ver/Abrir

Contacte-nos

Autores

Oliveira, Miguel César de Albuquerque

Orientador(es)

Medeiros, Ibéria Vitória de Sousa, 1971-

Resumo(s)

Security in web applications is often compromised by poorly written code that is exploited by attackers. Source code vulnerability detection tools have been developed using static analysis and machine learning techniques. The best performing tools seek for very low false negative rates along with acceptable false positives. Static analysis requires manual programming to identify vulnerabilities, depends on human expertise and is usually limited to a specific programming language. On the other hand, classical supervised machine learning approaches previously used may be limited to identify zero-day vulnerabilities or prone to overfit due to limited available datasets. This dissertation aims to develop a hybrid machine learning (ML) system for vulnerability detection of web applications. The system developed will use a combination of static analysis and Natural Language Processing (NLP) techniques to identify functions related to vulnerabilities that will be used to build representative datasets. The datasets will be used as input for unsupervised machine learning and other behaviour based anomaly detection algorithms in order to signalize as suspicious the code snippets under analysis. For these source code snippets, the system will aim to confirm which are vulnerable and identify the type of vulnerability via supervised machine learning techniques. The dissertation explores a novel approach to vulnerability detection by combining unsupervised anomaly detection models with supervised machine learning and Natural Language Processing techniques. Previous research in vulnerability detection has primarily focused on either unsupervised or supervised methods, neglecting the potential benefits of a hybrid approach. The goal of this research is to investigate the efficacy of hybrid architectures in identifying software vulnerabilities and to determine the optimal machine learning models and datasets for this purpose. The proposed hybrid model consists of different layers. The first uses a One Class Support Vector Machine model (OCSVM) to detect anomalies, the second employs a Random Forest Model to confirm the presence of vulnerabilities on the anomalies. The type of vulnerability is classified by a Logistic Regression Model that relies on the Doc2Vec model for feature extraction. The research includes experimentation with various machine learning models and datasets, evaluating simple binary features to more complex Doc2Vec embeddings. The thesis demonstrates OCSVM’s suitability for semi-unsupervised anomaly detection, yielding promising results across various datasets. Additionally, the study assesses Random Forests’ effectiveness in classifying vulnerable source code snippets based on OCSVMdetected anomalies and validate the use NLP techniques for feature extraction of sourcecode snippets. Overall, the proposed hybrid model achieved an accuracy of 65%. Although these results seems to be low, this research offers a promising hybrid approach to vulnerability detection, leveraging the strengths of unsupervised and supervised machine learning models. The findings suggest opportunities for further enhancements and optimizations, paving the way for more effective software vulnerability detection systems.

Descrição

Tese de mestrado, Ciências de Dados, 2023, Universidade de Lisboa, Faculdade de Ciências

Palavras-chave

deteção de vulnerabilidades de web aprendizagem automática detecção de anomalias processamento de linguagem natural segurança de software Teses de mestrado - 2024

URI

http://hdl.handle.net/10451/63629

Coleções

FC-DI - Master Thesis (dissertation)

Ver registo completo