| Name: | Description: | Size: | Format: | |
|---|---|---|---|---|
| 1.79 MB | Adobe PDF |
Authors
Abstract(s)
This work explores the potential of Protein Language Models (PLMs) to advance the design
of novel Adeno-Associated Virus 2 (AAV2) sequences, while focusing on two primary objectives:
sequence classification and generative design. For the classification task, we fine-tuned a PLM
(ProtBERT) to accurately differentiate between viable and non-viable AAV2 sequences. Results
demonstrated high classification performance across multiple trained models, validating the hypothesis that domain-specific fine-tuning enables PLMs to effectively capture important AAV2
sequence features. For sequence generation, we fine-tuned a conditional generative PLM (ProGen) to design viable AAV2 capsid protein sequences. While the model generated structurally
diverse sequences, extensive evaluations indicated that additional refinements are necessary to
consistently align with viability criteria. The classification model highlights the potential of PLMs
in predicting sequence viability, offering a reliable approach that could help reduce experimental
costs. We consider that the generative approach, though requiring further optimization, introduces
a novel avenue for designing diverse AAV2 variants. Future efforts will focus on refining the generative framework by incorporating explicit viability tags, classifier feedback, and more extensive
generation hyperparameter testing, as well as expanding its application to additional AAV2 properties. This work lays a foundation for leveraging PLMs in AAV2 sequence engineering, offering
promising prospects for the use of Language Models for viral vector design.
Description
Tese de Mestrado, Engenharia Informática, 2025, Universidade de Lisboa, Faculdade de Ciências
Keywords
Aprendizagem profunda Modelos de linguagem de proteínas Aprendizagem por transferência Representações Investigação de sequências proteicas Teses de mestrado - 2025
