Exploring Causal Attention Models in Transformers for Large Language Models

Falcão, André Osório e Cruz de Azerêdo, 1969-Terroa, João Filipe Gonçalves Vieira2025-01-072025-01-0720242024http://hdl.handle.net/10400.5/96896Tese de mestrado, Informática, 2024, Universidade de Lisboa, Faculdade de CiênciasTransformers have played a significant role in Natural Language Processing (NLP) since 2017, enabling significant advancements in applications like machine translation and text generation. Despite their success, they face challenges such as high computational costs and environmental impacts due to energy consumption. Recent research has focused on the development of more compact and efficient models operable on resource-limited devices without reliance on cloud infrastructures. Optimizing transformers seeks to balance performance with computational resources. The fundamental structure of transformers remains largely unchanged, with understanding based on empirical observations, leaving theoretical gaps. Studies suggest that components considered essential can be removed without compromising performance, indicating potential for reevaluating transformer components. This study aims to analyze the self-attention mechanism theoretically and empirically, examining existing optimizations and evaluating alternatives for improvements. Five modifications were designed: Simple Self-Attention (SSA), Layered Self-Attention (LSA), Variable Self-Attention (VSA), Simple Layered Self-Attention (SLSA), and Variable Layered Self-Attention (VLSA). Implementation involved exploratory and confirmatory phases. The exploratory phase altered the self-attention mechanism and extensively tested promising alternatives using the nanoGPT repository. The confirmatory phase fine-tuned the best-performing mechanisms, evaluating modified versions alongside the original self-attention as a baseline. Results indicated that Variable Layered Self-Attention (VLSA) models, especially with higher k values, outperformed the standard self-attention mechanism, achieving lower validation losses and improved generalization, even with fewer training iterations. These findings suggest that alternative attention mechanisms can enhance transformer-based language models without extensive architectural changes, offering a practical approach to improving efficiency and accuracy. In conclusion, this study demonstrates that the current self-attention implementation may not be optimal, and exploring alternative mechanisms can lead to significant improvements. Future work proposes applying these mechanisms to larger models and datasets and expanding evaluation metrics to include broader benchmarks, aiming to generalize the improvements and better understand the benefits of the proposed attention mechanisms.engtransformadoresmecanismos de auto-atençãoprocessamento de linguagem naturalmodelos de linguagemotimização de desempenhoTeses de mestrado - 2024Exploring Causal Attention Models in Transformers for Large Language Modelsmaster thesis203878442