LASIGE - Extreme Computing

Funder

Organizational Unit

Publications

Development of a Website for Creation of Vulnerability Datasets

Publication . Ferreira, Miguel Pinto da Silva; Neves, Nuno Fuentecilla Maia Ferreira; Medeiros, Ibéria Vitória de Sousa

With the evolution of the digital era, guaranteeing the robustness and security of software has become a major concern. In order to address this subject, it is important to effectively not only detect, but also mitigate software vulnerabilities. Static Analysis Tools (SATs) present a cost-effective solution to this, being able to achieve a cheap and fast analysis, but often incur in a high percentage of false positives and negatives. Recent studies suggest that machine learning (ML) techniques could enhance the effectiveness of these tools, but this requires trustworthy and reliable datasets to train the ML models. This dissertation aims to provide a way of create the aforesaid datasets that can help with the development of ML models capable of identifying vulnerabilities in computer programs. To achieve this, we propose a novel approach to construct these datasets, which consists in collecting inputs from the crowd as a way of mitigating the false positives and negatives generated by the SATs, but at the same time leverage from their deterministic classifications. This approach is applied within the context of web vulnerabilities that appear in applications built with the PHP programming language. To facilitate crowdsourcing, we developed a user-friendly website called BugSpotting where users can classify PHP code snippets, indicating whether these are vulnerable (or not vulnerable) to a set of vulnerability classes. With the results obtained both from the crowd and the SATs, we are able to obtain a reliable and trustworthy dataset comprised of accurately classified PHP code snippets. We evaluated BugSpotting in terms of UI and UX and the results obtained were very satisfactory. Moreover, although we were not able to reach a consensus about the code snippet’s final label, we still manage to analyse the data we have collected until the moment, showing promising results.

2024Master thesis

Open access

Automatic binary patching for flaws repairing using static rewriting and reverse dataflow analysis

Publication . Ferreira, Diogo Tomás; Medeiros, Ibéria Vitória de Sousa

The C programming language is widely used in embedded systems, kernel and hardware programming, making it one of the most commonly used programming languages. However, C lacks of boundary verification of variables, making it one of the most vulnerable languages. Because of this and associated with its high usability, it is also the language with most reported vulnerabilities in the past ten years, being the memory corruption the most common type of vulnerabilities, specifically buffer overflows. These vulnerabilities when exploited can produce critical consequences, being thus extremely important not only to correctly identify these vulnerabilities but also to properly fix them. This work aims to study buffer overflow vulnerabilities in C binary programs by identifying possible malicious inputs that can trigger such vulnerabilities and finding their root cause in order to mitigate the vulnerabilities by rewriting the binary assembly code and thus generating a new binary without the original flaw. The main focus of this thesis is the use of binary patching to automatically fix stack overflow vulnerabilities and validate its effectiveness while ensuring that these do not add new vulnerabilities. Working with the binary code of applications and without accessing their source code is a challenge because any required change to its binary code (i.e, assembly) needs to take into consideration that new instructions must be allocated, and this typically means that existing instructions will need to be moved to create room for new ones and recover the control flow information, otherwise the application would be compromised. The approach we propose to address this problem was successfully implemented in a tool and evaluated with a set of test cases and real applications. The evaluation results showed that the tool was effective in finding vulnerabilities, as well as in patching them.

2023Master thesis

Open access

A benchmark for biomedical knowledge graph based similarity

Publication . Cardoso, Carlota Maria Alegre Branco Ferreira; Pesquita, Cátia,1980-

Os grafos de conhecimento biomédicos são cruciais para sustentar aplicações em grandes quantidades de dados nas ciências da vida e saúde. Uma das aplicações mais comuns dos grafos de conhecimento nas ciências da vida é o apoio à comparação de entidades no grafo por meio das suas descrições ontológicas. Estas descrições suportam o cálculo da semelhança semântica entre duas entidades, e encontrar as suas semelhanças e diferenças é uma técnica fundamental para diversas aplicações, desde a previsão de interações proteína-proteína até à descoberta de associações entre doenças e genes, a previsão da localização celular de proteínas, entre outros. Na última década, houve um esforço considerável no desenvolvimento de medidas de semelhança semântica para grafos de conhecimento biomédico mas, até agora, a investigação nessa área tem-se concentrado na comparação de conjuntos de entidades relativamente pequenos. Dada a diversa gama de aplicações para medidas de semelhança semântica, é essencial apoiar a avaliação em grande escala destas medidas. No entanto, fazê-lo não é trivial, uma vez que não há um padrão ouro para a semelhança de entidades biológicas. Uma solução possível é comparar estas medidas com outras medidas ou proxies de semelhança. As entidades biológicas podem ser comparadas através de diferentes ângulos, por exemplo, a semelhança de sequência e estrutural de duas proteínas ou as vias metabólicas afetadas por duas doenças. Estas medidas estão relacionadas com as características relevantes das entidades, portanto podem ajudar a compreender como é que as abordagens de semelhança semântica capturam a semelhança das entidades. O objetivo deste trabalho é desenvolver um benchmark, composto por data sets e métodos de avaliação automatizados. Este benchmark deve sustentar a avaliação em grande escala de medidas de semelhança semântica para entidades biológicas, com base na sua correlação com diferentes propriedades das entidades. Para atingir este objetivo, uma metodologia para o desenvolvimento de data sets de referência para semelhança semântica foi desenvolvida e aplicada a dois grafos de conhecimento: proteínas anotadas com a Gene Ontology e genes anotados com a Human Phenotype Ontology. Este benchmark explora proxies de semelhança com base na semelhança de sequência, função molecular e interações de proteínas e semelhança de genes baseada em fenótipos, e fornece cálculos de semelhança semântica com medidas representativas do estado da arte, para uma avaliação comparativa. Isto resultou num benchmark composto por uma coleção de 21 data sets de referência com tamanhos variados, cobrindo quatro espécies e diferentes níveis de anotação das entidades, e técnicas de avaliação ajustadas aos data sets.

2020Master thesis

Open access

Recommender system to support comprehensive exploration of large scale scientific datasets

Publication . Barros, Márcia; Couto, Francisco José Moreira; Almeida, André Moitinho de

Bases de dados de entidades científicas, como compostos químicos, doenças e objetos astronómicos, têm crescido em tamanho e complexidade, chegando a milhares de milhões de itens por base de dados. Os investigadores precisam de ferramentas novas e inovadoras para auxiliar na escolha desses itens. Este trabalho propõe o uso de Sistemas de Recomendação para auxiliar os investigadores a encontrar itens de interesse. Identificamos como um dos maiores desafios para a aplicação de sistemas de recomendação em áreas científicas a falta de conjuntos de dados padronizados e de acesso aberto com informações sobre as preferências dos utilizadores. Para superar esse desafio, desenvolvemos uma metodologia denominada LIBRETTI - Recomendação Baseada em Literatura de Itens Científicos, cujo objetivo é a criação de conjuntos de dados , relacionados com campos científicos. Estes conjuntos de dados são criados com base no principal recurso de conhecimento que a Ciência possui: a literatura científica. A metodologia LIBRETTI permitiu o desenvolvimento de novos algoritmos de recomendação específicos para vários campos científicos. Além do LIBRETTI, as principais contribuições desta tese são conjuntos de dados de recomendação padronizados nas áreas de Astronomia, Química e Saúde (relacionado com a doença COVID-19), um sistema de recomendação semântica híbrido para compostos químicos em conjuntos de dados de grande escala, uma abordagem híbrida baseada no enriquecimento sequencial (SeEn) para recomendações sequenciais, um pipeline baseado em semântica de vários campos para recomendar entidades biomédicas relacionadas com a doença COVID-19.

2022-01Doctoral thesis

Open access

Modelling early visual processes of illiterate populations with Deep Belief Networks

Publication . Fottner, Nicola Alessandro; Fernandes, Tânia Patrícia Gregório; Correia, Luís Miguel Parreira e

The Neuronal Recycling Hypothesis (Dehaene, 2005; Dehaene & Cohen, 2007) proposes that the efficient computation and representation of written words at the orthographic stage of processing is enabled through the adaptation of pre-existing visual functions, which in turn, lead to the emergence of a specialised reading system. The present thesis aimed to investigate the emergence of neural detectors tuned to letters through biologically plausible computational models. A Deep Belief Network (DBN) was implemented as a model of visual shape perception, inspired by Testolin et al. (2017), and used to answer two questions: 1) does the DBN model generalise shape information that was learned from images of geometrical shapes towards classification of letters and pseudoletters (i.e., nonletters sharing the same features as letters); for example, classifying A as a triangle?; 2) is visual shape processing by a DBN sensitive to the same integration processes as those reflected in crowding effects (i.e., integration of adjacent information) by human observers; namely, by the congruency effect (better performance for targets surrounding by congruent than incongruent shapes)? The results showed that classification of letters and pseudoletters by our DBN was nonuniform across the different tested letter fonts, thus suggesting that decisions were not led by global shape. Interestingly, our model exhibited a congruence effect, and hence, a perceptual strategy similar to that previously found in illiterate adults (Fernandes et al., 2014). These results and further analyses also showed that our model’s perceptual strategy was not driven by low-level pixel similarities. The present work sets the stage to further emulate the transition from the illiterate to the ex-illiterate state, as done in the work of Hannagan et al. (2021) but with biologically more plausible learning algorithms (Bengio et al., 2015; Hinton & Salakhutdinov, 2006).

2023Master thesis

Open access