Processing Web Applications using NLP for vulnerability detection and explanation

Guerreiro,Jorge Manuel Gomes

http://hdl.handle.net/10400.5/116673

Utilize este identificador para referenciar este registo.

Nome:	Descrição:	Tamanho:	Formato:
TM_Jorge_Guerreiro.pdf		1.4 MB	Adobe PDF	Ver/Abrir

Contacte-nos

Autores

Guerreiro,Jorge Manuel Gomes

Orientador(es)

Medeiros,Ibéria Vitória de Sousa

Resumo(s)

Web applications increasingly store and process valuable user data, which makes flaws in their source code a direct threat to users’ privacy and security. Attackers commonly exploit vulnera- bilities such as SQL Injection (SQLi) and Cross-Site Scripting (XSS), often taking advantage of missing sanitization or validation of user in puts. Existing static analysis and machine-learning tools can flag potential faults, but rule-based methods struggle with evolving attack complexity and, crucially, provide explanations that are insufficient for developer store mediate the issues. This dissertation proposes a novel Natural Language Processing (NLP)- based methodology for detecting and explaining vulnerabilities in PHP web applications. We introduce an Intermediate Language (IL) produced by a pre-processing and translation tool, PHP2IL, which normalises PHP source code while preserving program logic and data flow. Pre-processing removes bias-ing elements (comments, literal strings) and obfuscates identifiers, enabling NLP models to learn structural and semantic patterns rather than surface artefacts. Two complementary families of models are employed: neural models (LSTM, Transformers) perform codes nippet classification (vulnerable or not vulnerable), while sequential supervised models (HMM, MEMM) are trained token-by-token on an notated IL to trace tainted data propagation and explain where sanitization is missing. However, before the explanation, an heuristic aggregates the classification models outputs, weighting predictions by model accuracy, and forwards the code in analysis to the corresponding explanatory models of the determined final class. The approach was implemented in the VulNLan tool, and we evaluated it with the NISTSARD data set focused on SQLi in PHP. Neural classifiers achieved approximately 96% precision and when combined in the VulNLan tool the ensemble reached an accuracy of 0.960 with an F1-score of 0.962. The sequential explanation models successfully identified taint propagation and sanitization points, producing interpretable outputs. These results show that applying NLP to an IL representation can both detect and explain vulnerabilities in PHP source, closing the gap between automated detection and actionable remediation – helping developers fix code and protecting end users.

Descrição

Tese de Mestrado, Informática, 2025, Universidade de Lisboa, Faculdade de Ciências

Palavras-chave

Vulnerabilities Web applications NLP models Explainability Softwaresecurity

URI

http://hdl.handle.net/10400.5/116673

Coleções

Pure > Dspace
PURE > Dspace - Faculdade de Ciências

Ver registo completo