Deep Learning to optimize viral vector production for human gene therapy

Ferraz, João Lucas Figueiredo

http://hdl.handle.net/10400.5/100019

Use this identifier to reference this record.

Name:	Description:	Size:	Format:
TM_João_Ferraz.pdf		1.79 MB	Adobe PDF	Download

Send Feedback

Authors

Ferraz, João Lucas Figueiredo

Advisor(s)

Pesquita, Cátia Luísa Santana Calisto

Rodrigues, Ana Filipa

Abstract(s)

This work explores the potential of Protein Language Models (PLMs) to advance the design of novel Adeno-Associated Virus 2 (AAV2) sequences, while focusing on two primary objectives: sequence classification and generative design. For the classification task, we fine-tuned a PLM (ProtBERT) to accurately differentiate between viable and non-viable AAV2 sequences. Results demonstrated high classification performance across multiple trained models, validating the hypothesis that domain-specific fine-tuning enables PLMs to effectively capture important AAV2 sequence features. For sequence generation, we fine-tuned a conditional generative PLM (ProGen) to design viable AAV2 capsid protein sequences. While the model generated structurally diverse sequences, extensive evaluations indicated that additional refinements are necessary to consistently align with viability criteria. The classification model highlights the potential of PLMs in predicting sequence viability, offering a reliable approach that could help reduce experimental costs. We consider that the generative approach, though requiring further optimization, introduces a novel avenue for designing diverse AAV2 variants. Future efforts will focus on refining the generative framework by incorporating explicit viability tags, classifier feedback, and more extensive generation hyperparameter testing, as well as expanding its application to additional AAV2 properties. This work lays a foundation for leveraging PLMs in AAV2 sequence engineering, offering promising prospects for the use of Language Models for viral vector design.

Description

Tese de Mestrado, Engenharia Informática, 2025, Universidade de Lisboa, Faculdade de Ciências

Keywords

Aprendizagem profunda Modelos de linguagem de proteínas Aprendizagem por transferência Representações Investigação de sequências proteicas Teses de mestrado - 2025