Mataloto, Diogo Miguel Gomes2026-02-122026-02-122025http://hdl.handle.net/10400.5/117053Tese de Mestrado, Bioinformática e Biologia Computacional, 2025, Universidade de Lisboa, Faculdade de CiênciasCardiovascular and cerebrovascular diseases (CCDs) remain among the leading causes of morbidity and mortality worldwide, posing significant challenges to healthcare systems. Electronic Health Records (EHRs) store vast amounts of structured and unstructured clinical data, with critical and detailed information often found in free-text clinical notes. Extracting knowledge from these notes is essential to improve patient care and support clinical research. However, clinical notes present substantial challenges, including considerable document length, highly specialized terminology, ambiguities, and multilingual variation. This dissertation addresses these challenges through the development of a modular pipeline — LLaMIC (LLaMA Models applied to MIMIC) — designed for entity recognition and relation extraction in clinical notes. The work is organized into two phases. In the first phase, the LLaMIC pipeline is presented, integrating the large-scale language model LLaMA for entity detection, linking these entities to standardized terminologies (International Classification of Diseases and MeSH), and extracting relations, with an emphasis on therapeutic associations. In the second phase, the creation of three supervised corpora for CCDs, therapeutic drugs (CCDt), and their respective relations is described, combining automatic annotation with non-expert manual correction, as well as the implementation of the LLaMIC pipeline on these corpora. Evaluation of LLaMIC demonstrates substantial improvements over baseline models. For entity recognition in a lenient mode, the best LLaMIC model achieved a precision of 0.887 for CCDs, surpassing BENT by 32% and closely matching BENT’s performance for CCDt (2 percentage points lower). Relation extraction outperformed BioLinkBERT by 4%. The annotated corpora and optimized models are publicly available.application/pdfengCardiovascular and Cerebrovascular diseasesElectronic Health RecordsNamed Entity RecognitionRelation ExtractionLarge Language ModelsLeveraging LLMs for cardiovascular disease-oriented entity recognition and relation extraction in electronic health recordsmaster thesis204173302