Logo do repositório
 
Publicação

Prediction of ligand binding activity across the human druggable genome with Proximity Amino Acid Sets

dc.contributor.authorLosco, Simona
dc.contributor.institutionFaculty of Sciences
dc.contributor.supervisorFalcão, André Osório e Cruz de Azeredo
dc.date.accessioned2026-02-16T12:20:01Z
dc.date.available2026-02-16T12:20:01Z
dc.date.issued2026
dc.descriptionTese de Mestrado, Bioinformática e Biologia Computacional, 2026, Universidade de Lisboa, Faculdade de Ciências
dc.description.abstractUnderstanding protein-ligand interactions across the human druggable genome is crucial for accelerating drug discovery. This study develops a comprehensive computational framework that integrates structural bioinformatics with machine learning to predict binding affinities, with particular focus on identifying critical local protein structural motifs that govern molecular recognition. The study introduces Proximity Amino Acid Sets (PAAS), a novel structural representation derived from AlphaFold 2.0-predicted structures. PAAS captures local spatial neighborhoods of amino acids and encodes them using the BLOSUM62 substitution matrix followed by principal component analysis (PCA), employing a Word2Vec-like approach where amino acid vectors are summed to represent local structural environments. A two-stage K-means clustering approach organizes these structural patterns into 20,736 clusters, creating a hierarchical representation of structural motifs across the human proteome. These structural descriptors are systematically combined with ECFP8 molecular fingerprints in a proteochemometric modeling (PCM) framework. We evaluate multiple machine learning architectures, including ensemble methods (XGBoost, Random Forests), transformer encoders, and contrastive learning models, employing rigorous protein-centric validation to assess generalization to novel targets. The methodology identifies high-importance PAAS clusters localized near functional protein domains, demonstrating their role as structural determinants of binding. XGBoost achieves superior performance (test R2 = 0.80 for Ki prediction), significantly outperforming deep learning models in generalization capability. Validation on completely novel (orphaned) targets explains 31.5% of binding affinity variance, indicating meaningful generalization to unseen proteins. Structural and functional analysis reveals that predictive PAAS clusters frequently coincide with known binding sites, supported by Gene Ontology (GO) enrichment analysis showing 43% of high-importance clusters have statistically significant associations with specific molecular functions. The framework establishes structure-aware representations that capture key interaction determinants, enabling reliable prediction even for proteins with limited experimental data. This approach provides practical tools for drug discovery applications, including ligand prioritization for emerging targets and systematic off-target assessment, while offering fundamental insights into principles of molecular recognition. The consistent superiority of ensemble methods over complex deep learning architectures in this application challenges conventional assumptions about model complexity requirements for biomolecular interaction prediction.en
dc.formatapplication/pdf
dc.identifier.urihttp://hdl.handle.net/10400.5/117102
dc.language.isoeng
dc.subjectProximity Amino Acid Sets (PAAS)
dc.subjectProtein-Ligand Interaction Prediction
dc.subjectHuman Druggable Genome
dc.subjectAmino Acid Embeddings
dc.subjectMachine Learning
dc.titlePrediction of ligand binding activity across the human druggable genome with Proximity Amino Acid Setsen
dc.typemaster thesis
dspace.entity.typePublication
rcaap.rightsopenAccess

Ficheiros

Principais
A mostrar 1 - 1 de 1
A carregar...
Miniatura
Nome:
TM_Simona_Losco.pdf
Tamanho:
2.26 MB
Formato:
Adobe Portable Document Format