| Nome: | Descrição: | Tamanho: | Formato: | |
|---|---|---|---|---|
| 2.79 MB | Adobe PDF |
Autores
Orientador(es)
Resumo(s)
Cardiovascular and cerebrovascular diseases (CCDs) remain among the leading causes of morbidity and mortality worldwide, posing significant challenges to healthcare systems. Electronic Health Records (EHRs) store vast amounts of structured and unstructured clinical data, with critical and detailed information often found in free-text clinical notes. Extracting knowledge from these notes is essential to improve patient care and support clinical research. However, clinical notes present substantial challenges, including considerable document length, highly specialized terminology, ambiguities, and multilingual variation. This dissertation addresses these challenges through the development of a modular pipeline — LLaMIC (LLaMA Models applied to MIMIC) — designed for entity recognition and relation extraction in clinical notes. The work is organized into two phases. In the first phase, the LLaMIC pipeline is presented, integrating the large-scale language model LLaMA for entity detection, linking these entities to standardized terminologies (International Classification of Diseases and MeSH), and extracting relations, with an emphasis on therapeutic associations. In the second phase, the creation of three supervised corpora for CCDs, therapeutic drugs (CCDt), and their respective relations is described, combining automatic annotation with non-expert manual correction, as well as the implementation of the LLaMIC pipeline on these corpora. Evaluation of LLaMIC demonstrates substantial improvements over baseline models. For entity recognition in a lenient mode, the best LLaMIC model achieved a precision of 0.887 for CCDs, surpassing BENT by 32% and closely matching BENT’s performance for CCDt (2 percentage points lower). Relation extraction outperformed BioLinkBERT by 4%. The annotated corpora and optimized models are publicly available.
Descrição
Tese de Mestrado, Bioinformática e Biologia Computacional, 2025, Universidade de Lisboa, Faculdade de Ciências
Palavras-chave
Cardiovascular and Cerebrovascular diseases Electronic Health Records Named Entity Recognition Relation Extraction Large Language Models
