| Nome: | Descrição: | Tamanho: | Formato: | |
|---|---|---|---|---|
| 3.75 MB | Adobe PDF |
Autores
Orientador(es)
Resumo(s)
The vastness of chemical space poses a daunting challenge in drug discovery - particularly when
predicting drug-target interactions (DTIs) for novel or orphan protein targets. Graph Neural Networks
(GNNs) have emerged as powerful deep learning models for modeling complex biological interactions,
yet they often struggle with the cold-start problem - which can be defined as generalisation to unseen
proteins or ligands. In this thesis this challenge is addressed by developing a GNN-based model using PyTorch Geometric to predict inhibitory constants (Ki) and half-maximal inhibitory concentrations
(IC50) for protein-ligand pairs.
Comprehensive datasets were extracted from the ChEMBL database, focusing on Swiss-Prot verified proteins to ensure high quality interaction data. After filtering to remove entries lacking canonical SMILES, missing activity values, or containing ambiguous activity measurements, refined datasets
of 276,098 Ki entries (70.83% retention) and 412,726 IC50 entries (82.29% retention) were obtained.
Molecules and proteins were represented as graph objects, enabling the GNN to capture intricate structural and relational features. The model architecture consisted of dual encoders for molecules and proteins, respectively, whose learned features were fused by concatenation and fed into a multilayer perceptron head for activity prediction.
Experiments revealed a clear discrepancy in model performance between traditional random splits
and a more stringent cold-start evaluation. While the model showed strong predictive capabilities on
a validation set randomly sampled from the training data, it performed poorer on a ’blinded’ cold-start
dataset where entire proteins and their interactions were excluded before splitting. The model detects
some signal in the blind dataset, yet this decline highlights the model’s struggle to generalise to entirely
new proteins. This is a common scenario in drug discovery when seeking ligands for orphan targets.
These results highlight the limitations of current GNN approaches in addressing the cold-start problem and emphasize the need for novel strategies to enhance model generalisation. Future work should
explore advanced techniques such as transfer learning, incorporation of protein domain knowledge, incorporating knowledge graphs, and data augmentation to mitigate this issue and improve performance.
Overcoming the cold-start challenge is crucial for broadening the scope of targetable proteins and expediting the development of treatments for previously unstudied targets.
Descrição
Tese de mestrado, Bioinformática e Biologia Computacional, 2025, Universidade de Lisboa, Faculdade de Ciências
Palavras-chave
Aprendizagem profunda interação fármaco-alvo (DTI) redes neuronais de grafos (GNNs) rastreio virtual de ligandos (VLS) alvos órfãos Teses de mestrado - 2025
