Prediction of ligand binding activity across the human druggable genome with Proximity Amino Acid Sets

Losco, Simona

Publicação

Prediction of ligand binding activity across the human druggable genome with Proximity Amino Acid Sets

2026Dissertação de mestrado

dc.contributor.author	Losco, Simona
dc.contributor.institution	Faculty of Sciences
dc.contributor.supervisor	Falcão, André Osório e Cruz de Azeredo
dc.date.accessioned	2026-02-16T12:20:01Z
dc.date.available	2026-02-16T12:20:01Z
dc.date.issued	2026
dc.description	Tese de Mestrado, Bioinformática e Biologia Computacional, 2026, Universidade de Lisboa, Faculdade de Ciências
dc.description.abstract	Understanding protein-ligand interactions across the human druggable genome is crucial for accelerating drug discovery. This study develops a comprehensive computational framework that integrates structural bioinformatics with machine learning to predict binding affinities, with particular focus on identifying critical local protein structural motifs that govern molecular recognition. The study introduces Proximity Amino Acid Sets (PAAS), a novel structural representation derived from AlphaFold 2.0-predicted structures. PAAS captures local spatial neighborhoods of amino acids and encodes them using the BLOSUM62 substitution matrix followed by principal component analysis (PCA), employing a Word2Vec-like approach where amino acid vectors are summed to represent local structural environments. A two-stage K-means clustering approach organizes these structural patterns into 20,736 clusters, creating a hierarchical representation of structural motifs across the human proteome. These structural descriptors are systematically combined with ECFP8 molecular fingerprints in a proteochemometric modeling (PCM) framework. We evaluate multiple machine learning architectures, including ensemble methods (XGBoost, Random Forests), transformer encoders, and contrastive learning models, employing rigorous protein-centric validation to assess generalization to novel targets. The methodology identifies high-importance PAAS clusters localized near functional protein domains, demonstrating their role as structural determinants of binding. XGBoost achieves superior performance (test R2 = 0.80 for Ki prediction), significantly outperforming deep learning models in generalization capability. Validation on completely novel (orphaned) targets explains 31.5% of binding affinity variance, indicating meaningful generalization to unseen proteins. Structural and functional analysis reveals that predictive PAAS clusters frequently coincide with known binding sites, supported by Gene Ontology (GO) enrichment analysis showing 43% of high-importance clusters have statistically significant associations with specific molecular functions. The framework establishes structure-aware representations that capture key interaction determinants, enabling reliable prediction even for proteins with limited experimental data. This approach provides practical tools for drug discovery applications, including ligand prioritization for emerging targets and systematic off-target assessment, while offering fundamental insights into principles of molecular recognition. The consistent superiority of ensemble methods over complex deep learning architectures in this application challenges conventional assumptions about model complexity requirements for biomolecular interaction prediction.	en
dc.format	application/pdf
dc.identifier.uri	http://hdl.handle.net/10400.5/117102
dc.language.iso	eng
dc.subject	Proximity Amino Acid Sets (PAAS)
dc.subject	Protein-Ligand Interaction Prediction
dc.subject	Human Druggable Genome
dc.subject	Amino Acid Embeddings
dc.subject	Machine Learning
dc.title	Prediction of ligand binding activity across the human druggable genome with Proximity Amino Acid Sets	en
dc.type	master thesis
dspace.entity.type	Publication
rcaap.rights	openAccess

Ficheiros

Principais

A mostrar 1 - 1 de 1

Nome:: TM_Simona_Losco.pdf
Tamanho:: 2.26 MB
Formato:: Adobe Portable Document Format

Ver/Abrir

Coleções

Pure > Dspace
PURE > Dspace - Faculdade de Ciências