Logo do repositório
 
Publicação

Leveraging LLMs for cardiovascular disease-oriented entity recognition and relation extraction in electronic health records

dc.contributor.authorMataloto, Diogo Miguel Gomes
dc.contributor.institutionFaculty of Sciences
dc.contributor.supervisorFernandes, Maria Isabel Mou Sequeira
dc.contributor.supervisorCouto, Francisco José Moreira
dc.date.accessioned2026-02-12T16:50:04Z
dc.date.available2026-02-12T16:50:04Z
dc.date.issued2025
dc.descriptionTese de Mestrado, Bioinformática e Biologia Computacional, 2025, Universidade de Lisboa, Faculdade de Ciências
dc.description.abstractCardiovascular and cerebrovascular diseases (CCDs) remain among the leading causes of morbidity and mortality worldwide, posing significant challenges to healthcare systems. Electronic Health Records (EHRs) store vast amounts of structured and unstructured clinical data, with critical and detailed information often found in free-text clinical notes. Extracting knowledge from these notes is essential to improve patient care and support clinical research. However, clinical notes present substantial challenges, including considerable document length, highly specialized terminology, ambiguities, and multilingual variation. This dissertation addresses these challenges through the development of a modular pipeline — LLaMIC (LLaMA Models applied to MIMIC) — designed for entity recognition and relation extraction in clinical notes. The work is organized into two phases. In the first phase, the LLaMIC pipeline is presented, integrating the large-scale language model LLaMA for entity detection, linking these entities to standardized terminologies (International Classification of Diseases and MeSH), and extracting relations, with an emphasis on therapeutic associations. In the second phase, the creation of three supervised corpora for CCDs, therapeutic drugs (CCDt), and their respective relations is described, combining automatic annotation with non-expert manual correction, as well as the implementation of the LLaMIC pipeline on these corpora. Evaluation of LLaMIC demonstrates substantial improvements over baseline models. For entity recognition in a lenient mode, the best LLaMIC model achieved a precision of 0.887 for CCDs, surpassing BENT by 32% and closely matching BENT’s performance for CCDt (2 percentage points lower). Relation extraction outperformed BioLinkBERT by 4%. The annotated corpora and optimized models are publicly available.en
dc.formatapplication/pdf
dc.identifier.tid204173302
dc.identifier.urihttp://hdl.handle.net/10400.5/117053
dc.language.isoeng
dc.subjectCardiovascular and Cerebrovascular diseases
dc.subjectElectronic Health Records
dc.subjectNamed Entity Recognition
dc.subjectRelation Extraction
dc.subjectLarge Language Models
dc.titleLeveraging LLMs for cardiovascular disease-oriented entity recognition and relation extraction in electronic health recordsen
dc.typemaster thesis
dspace.entity.typePublication
rcaap.rightsopenAccess

Ficheiros

Principais
A mostrar 1 - 1 de 1
A carregar...
Miniatura
Nome:
TM_Diogo_Mataloto.pdf
Tamanho:
2.79 MB
Formato:
Adobe Portable Document Format