Exploring Causal Attention Models in Transformers for Large Language Models

Terroa, João Filipe Gonçalves Vieira

http://hdl.handle.net/10400.5/96896

Utilize este identificador para referenciar este registo.

Nome:	Descrição:	Tamanho:	Formato:
TM_João_Terroa.pdf		3.41 MB	Adobe PDF	Ver/Abrir

Contacte-nos

Autores

Terroa, João Filipe Gonçalves Vieira

Orientador(es)

Falcão, André Osório e Cruz de Azerêdo, 1969-

Resumo(s)

Transformers have played a significant role in Natural Language Processing (NLP) since 2017, enabling significant advancements in applications like machine translation and text generation. Despite their success, they face challenges such as high computational costs and environmental impacts due to energy consumption. Recent research has focused on the development of more compact and efficient models operable on resource-limited devices without reliance on cloud infrastructures. Optimizing transformers seeks to balance performance with computational resources. The fundamental structure of transformers remains largely unchanged, with understanding based on empirical observations, leaving theoretical gaps. Studies suggest that components considered essential can be removed without compromising performance, indicating potential for reevaluating transformer components. This study aims to analyze the self-attention mechanism theoretically and empirically, examining existing optimizations and evaluating alternatives for improvements. Five modifications were designed: Simple Self-Attention (SSA), Layered Self-Attention (LSA), Variable Self-Attention (VSA), Simple Layered Self-Attention (SLSA), and Variable Layered Self-Attention (VLSA). Implementation involved exploratory and confirmatory phases. The exploratory phase altered the self-attention mechanism and extensively tested promising alternatives using the nanoGPT repository. The confirmatory phase fine-tuned the best-performing mechanisms, evaluating modified versions alongside the original self-attention as a baseline. Results indicated that Variable Layered Self-Attention (VLSA) models, especially with higher k values, outperformed the standard self-attention mechanism, achieving lower validation losses and improved generalization, even with fewer training iterations. These findings suggest that alternative attention mechanisms can enhance transformer-based language models without extensive architectural changes, offering a practical approach to improving efficiency and accuracy. In conclusion, this study demonstrates that the current self-attention implementation may not be optimal, and exploring alternative mechanisms can lead to significant improvements. Future work proposes applying these mechanisms to larger models and datasets and expanding evaluation metrics to include broader benchmarks, aiming to generalize the improvements and better understand the benefits of the proposed attention mechanisms.

Descrição

Tese de mestrado, Informática, 2024, Universidade de Lisboa, Faculdade de Ciências

Palavras-chave

transformadores mecanismos de auto-atenção processamento de linguagem natural modelos de linguagem otimização de desempenho Teses de mestrado - 2024

URI

http://hdl.handle.net/10400.5/96896

Coleções

FC-DI - Master Thesis (dissertation)

Ver registo completo