Leveraging deep learning models for studying RNA splicing in health and disease

Barbosa, Pedro

http://hdl.handle.net/10400.5/99505

Utilize este identificador para referenciar este registo.

Nome:	Descrição:	Tamanho:	Formato:
scnd990026354742383_td_Pedro_Barbosa.pdf		28.22 MB	Adobe PDF	Ver/Abrir

Contacte-nos

Autores

Barbosa, Pedro

Orientador(es)

Fonseca, Alcides

Fonseca, Maria Carmo

Resumo(s)

Deep learning models have demonstrated remarkable potential across various domains, including biology. Despite this, scientists and clinicians face significant challenges when using these tools in practice. One major concern is that biological systems are inherently complex, raising doubts about whether even the most advanced models can fully capture their intricacies. Additionally, the “black box” nature of these million-parameter models poses a barrier to researchers seeking not only accurate predictions but also mechanistic understanding. This thesis investigates the practical applications of deep learning models within research contexts. It specifically focuses on models trained on genomic sequences to predict RNA splicing, the domain of application for this work. The research addresses two key challenges: variant effect prediction, which serves as a practical application to assess model performance, and model interpretability, which aims to advance scientific understanding of the underlying mechanisms of RNA splicing. We first addressed the problem of predicting the effects of variants that affect splicing in deep intronic regions, which are often ignored in genetic tests but now are recognized as important. Through a comprehensive evaluation of state-of-the-art computational models on curated datasets of disease-causing deep intronic variants, we revealed the strengths and limitations of these methods. In particular, we found that pure sequence-based deep learning models, like SpliceAI and Pangolin, were effective in variant prediction and that models combining SpliceAI predictions with additional features did not improve performance. Nevertheless, we showed that regardless of the model, there is still room for improvement, especially for variants disrupting splicing regulatory elements, which were often misclassified. For model interpretability, we conducted an in-depth analysis of SpliceAI to uncover its learned representations of RNA splicing mechanisms. We investigated SpliceAI’s ability to study alternative splicing via large-scale ablation experiments. Our findings showed that SpliceAI distinguishes between constitutive and alternatively spliced exons and uses RNA-binding protein motifs as features for its predictions. However, we also highlight some limitations and cautions for its use in such analyses. We further explored model interpretability by developing strategies to study deep learning models and splicing locally, at the individual exon level. We combined genetic programming algorithms with domain-aware grammars to produce semantically-rich synthetic datasets, demonstrating their suitability for local explainable AI in genomics. Additionally, we highlighted the expressive power of grammars in enabling the design of in silico experiments, using the deep learning model as an oracle. These concepts were integrated into a software, designed for splicing-related experiments, assuming the deep learning model accurately reflects the biological processes involved. In conclusion, this thesis demonstrated both the potential and current limitations of using deep learning models as research tools for studying RNA splicing. This work also contributed open-source software and placed a strong focus on reproducibility, ensuring that this research can be adopted and extended by the broader scientific community.

Palavras-chave

Deep Learning RNA Splicing Variant Effect Prediction Explainable AI Genetic Programming Splicing do RNA Previsão de Variantes Genéticas IA Explicável Programação Genética