| Name: | Description: | Size: | Format: | |
|---|---|---|---|---|
| 1.12 MB | Adobe PDF |
Authors
Advisor(s)
Abstract(s)
The rapid growth of biomedical literature makes it challenging for researchers to stay up-to-date. Text mining has become essential for efficiently extracting knowledge from unstructured texts. Abstracts offer a focused alternative to full-text articles, but extracting meaningful insights remains difficult. Key tasks such as Named Entity Recognition (NER) and Named Entity Linking (NEL) face issues like ambiguous terminology, entity variability, and incomplete knowledge bases, especially when handling novel or NIL (not-in-lexicon) entities. Relation Extraction (RE) systems also face challenges, including limited scope, lack of interpretability, and a focus on binary relations that do not fully capture complex biomedical interactions. This thesis introduces a small gold-standard dataset created by expanding 31 abstracts from the 600-document BioRED corpus. The dataset adds CellTypeOrAnatomicalConcept and NIL entities, serving as a resource to test and improve the Biomedical Entity Annotator (BENT) tool for NER and NEL. It also enables the extension of relation extraction from binary to n-ary relations, starting with ternary relations. Compared to BioRED, NER performance was generally lower across most entity types, while NEL showed particularly low scores for GeneOrGeneProduct, CellTypeOrAnatomicalConcept, and NIL entities, reflecting the challenges of novel entity annotation. For n-ary relation extraction, the K-RET system, built on BERT-based models, was employed with SciBERT and BioMedBERT. In the binary setting, the system achieved an F1-score of 0.775 compared to BioRED’s 0.7562. Ternary relations were evaluated against BioRex, a state-of-the-art study, yielding F1- scores of approximately 0.65. Despite being lower than BioRex, the results provide a promising baseline for n-ary relation extraction across a broader set of entity types.
Description
Tese de mestrado, Bioquímica e Biomedicina, 2025, Universidade de Lisboa, Faculdade de Ciências
Keywords
Machine Learning Methods Natural Language Processing Relation Extraction Text Mining Biomedical Literature
