Leveraging LLMs for cardiovascular disease-oriented entity recognition and relation extraction in electronic health records

Mataloto,Diogo Miguel Gomes

http://hdl.handle.net/10400.5/117053

Utilize este identificador para referenciar este registo.

Nome:	Descrição:	Tamanho:	Formato:
TM_Diogo_Mataloto.pdf		2.79 MB	Adobe PDF	Ver/Abrir

Contacte-nos

Autores

Mataloto,Diogo Miguel Gomes

Orientador(es)

Fernandes,Maria Isabel Mou Sequeira

Couto,Francisco José Moreira

Resumo(s)

Cardiovascular and cerebrovascular diseases (CCDs) remain among the leading causes of morbidity and mortality worldwide, posing significant challenges to healthcare systems. Electronic Health Records (EHRs) store vast amounts of structured and unstructured clinical data, with critical and detailed information often found in free-text clinical notes. Extracting knowledge from these notes is essential to improve patient care and support clinical research. However, clinical notes present substantial challenges, including considerable document length, highly specialized terminology, ambiguities, and multilingual variation. This dissertation addresses these challenges through the development of a modular pipeline — LLaMIC (LLaMA Models applied to MIMIC) — designed for entity recognition and relation extraction in clinical notes. The work is organized into two phases. In the first phase, the LLaMIC pipeline is presented, integrating the large-scale language model LLaMA for entity detection, linking these entities to standardized terminologies (International Classification of Diseases and MeSH), and extracting relations, with an emphasis on therapeutic associations. In the second phase, the creation of three supervised corpora for CCDs, therapeutic drugs (CCDt), and their respective relations is described, combining automatic annotation with non-expert manual correction, as well as the implementation of the LLaMIC pipeline on these corpora. Evaluation of LLaMIC demonstrates substantial improvements over baseline models. For entity recognition in a lenient mode, the best LLaMIC model achieved a precision of 0.887 for CCDs, surpassing BENT by 32% and closely matching BENT’s performance for CCDt (2 percentage points lower). Relation extraction outperformed BioLinkBERT by 4%. The annotated corpora and optimized models are publicly available.

Descrição

Tese de Mestrado, Bioinformática e Biologia Computacional, 2025, Universidade de Lisboa, Faculdade de Ciências

Palavras-chave

Cardiovascular and Cerebrovascular diseases Electronic Health Records Named Entity Recognition Relation Extraction Large Language Models

URI

http://hdl.handle.net/10400.5/117053

Coleções

Pure > Dspace
PURE > Dspace - Faculdade de Ciências

Ver registo completo