Publicação
Prediction of ligand binding activity across the human druggable genome with Proximity Amino Acid Sets
| dc.contributor.author | Losco, Simona | |
| dc.contributor.institution | Faculty of Sciences | |
| dc.contributor.supervisor | Falcão, André Osório e Cruz de Azeredo | |
| dc.date.accessioned | 2026-02-16T12:20:01Z | |
| dc.date.available | 2026-02-16T12:20:01Z | |
| dc.date.issued | 2026 | |
| dc.description | Tese de Mestrado, Bioinformática e Biologia Computacional, 2026, Universidade de Lisboa, Faculdade de Ciências | |
| dc.description.abstract | Understanding protein-ligand interactions across the human druggable genome is crucial for accelerating drug discovery. This study develops a comprehensive computational framework that integrates structural bioinformatics with machine learning to predict binding affinities, with particular focus on identifying critical local protein structural motifs that govern molecular recognition. The study introduces Proximity Amino Acid Sets (PAAS), a novel structural representation derived from AlphaFold 2.0-predicted structures. PAAS captures local spatial neighborhoods of amino acids and encodes them using the BLOSUM62 substitution matrix followed by principal component analysis (PCA), employing a Word2Vec-like approach where amino acid vectors are summed to represent local structural environments. A two-stage K-means clustering approach organizes these structural patterns into 20,736 clusters, creating a hierarchical representation of structural motifs across the human proteome. These structural descriptors are systematically combined with ECFP8 molecular fingerprints in a proteochemometric modeling (PCM) framework. We evaluate multiple machine learning architectures, including ensemble methods (XGBoost, Random Forests), transformer encoders, and contrastive learning models, employing rigorous protein-centric validation to assess generalization to novel targets. The methodology identifies high-importance PAAS clusters localized near functional protein domains, demonstrating their role as structural determinants of binding. XGBoost achieves superior performance (test R2 = 0.80 for Ki prediction), significantly outperforming deep learning models in generalization capability. Validation on completely novel (orphaned) targets explains 31.5% of binding affinity variance, indicating meaningful generalization to unseen proteins. Structural and functional analysis reveals that predictive PAAS clusters frequently coincide with known binding sites, supported by Gene Ontology (GO) enrichment analysis showing 43% of high-importance clusters have statistically significant associations with specific molecular functions. The framework establishes structure-aware representations that capture key interaction determinants, enabling reliable prediction even for proteins with limited experimental data. This approach provides practical tools for drug discovery applications, including ligand prioritization for emerging targets and systematic off-target assessment, while offering fundamental insights into principles of molecular recognition. The consistent superiority of ensemble methods over complex deep learning architectures in this application challenges conventional assumptions about model complexity requirements for biomolecular interaction prediction. | en |
| dc.format | application/pdf | |
| dc.identifier.uri | http://hdl.handle.net/10400.5/117102 | |
| dc.language.iso | eng | |
| dc.subject | Proximity Amino Acid Sets (PAAS) | |
| dc.subject | Protein-Ligand Interaction Prediction | |
| dc.subject | Human Druggable Genome | |
| dc.subject | Amino Acid Embeddings | |
| dc.subject | Machine Learning | |
| dc.title | Prediction of ligand binding activity across the human druggable genome with Proximity Amino Acid Sets | en |
| dc.type | master thesis | |
| dspace.entity.type | Publication | |
| rcaap.rights | openAccess |
Ficheiros
Principais
1 - 1 de 1
A carregar...
- Nome:
- TM_Simona_Losco.pdf
- Tamanho:
- 2.26 MB
- Formato:
- Adobe Portable Document Format
