Leveraging Large Language Models for Document Classification

Nogueira, Rómulo Brandão

http://hdl.handle.net/10400.5/116936

Utilize este identificador para referenciar este registo.

Nome:	Descrição:	Tamanho:	Formato:
TM_Romulo_Nogueira.pdf		1.91 MB	Adobe PDF	Ver/Abrir

Contacte-nos

Autores

Nogueira, Rómulo Brandão

Resumo(s)

Document classification serves as foundational step in critical tasks such as information extraction, analysis and decision-making. However, existing approaches often struggle with the variability, volume, and complexity of real-world documents. These methods are further limited by a lack of configurability and explainability, requiring specialized technical expertise to accommodate diverse user needs and often producing results that are difficult to interpret. To address the complexities of modern document processing, this dissertation introduces a novel zero-shot document classification framework that leverages Large Language Models (LLMs), designed for accessibility and configurability by both technical and non-technical users. Unlike traditional methods, which require extensive labeled data, the zero-shot configuration enables the framework to perform the classification task without any prior exposure to labeled examples of the target categories, relying instead on semantic understanding derived from user-provided label descriptions and document content. To validate the proposed framework, a dataset tailored to the banking sector was constructed, bringing together documents of different types and sizes. Based on this corpus, three distinct use cases were defined, designed to assess the practical usefulness of the framework in different scenarios. The subsets were further explored through evaluations of different retrieval strategies and through comparisons with competing zero-shot approaches whether using LLMs or not, providing a broader perspective on the framework’s effectiveness. Experimental results show that the framework achieves higher accuracy while requiring fewer tokens, which directly translates into lower operating costs compared to the baseline, pointing toward a gradual refinement in current document classification practices.

Descrição

Trabalho de projeto de mestrado, Engenharia Informática, 2025, Universidade de Lisboa, Faculdade de Ciências

Palavras-chave

Document Classification Large Language Models Banking Sector Zero-Shot Classifier Retrieval-Augmented Generation

URI

http://hdl.handle.net/10400.5/116936

Coleções

Pure > Dspace
PURE > Dspace - Faculdade de Ciências

Ver registo completo