Logo do repositório
 
Publicação

Semantic Indexing of Descriptors, Partitioning and Classification for Mapping Biomedical Entities

dc.contributor.authorVedor, João Pedro Ramos
dc.contributor.institutionFaculty of Sciences
dc.contributor.institutionDepartment of Informatics
dc.contributor.supervisorCouto, Francisco José Moreira
dc.date.accessioned2026-02-09T17:55:02Z
dc.date.available2026-02-09T17:55:02Z
dc.date.issued2025
dc.descriptionTese de mestrado, Engenharia Informática, 2025, Universidade de Lisboa, Faculdade de Ciências
dc.description.abstractImagine reading a clinical note and encountering the term ALS.” Is it amyotrophic lateral sclerosis (a disease), advanced life support (a procedure), or even a gene symbol? Traditional Named Entity Recognition (NER) can identify that ALS” is a mention of a biomedical entity, but it does not disambiguate which specific concept it refers to. Named Entity Linking (NEL) is required to map the mention to a unique identifier in an ontology such as UMLS or MeSH, ensuring that downstream systems understand precisely which disease, procedure, or gene is referenced. For example, correctly linking “ALS” to the UMLS concept for amyotrophic lateral sclerosis allows consistent retrieval, reasoning, and integration across biomedical datasets. Scaling NEL across articles, clinical narratives, and curated resources is challenging due to the size of ontologies, acronym collisions, synonyms, and ambiguous mentions (e.g., “Lou Gehrig’s disease” vs. “ALS”). Motivated by these challenges, this thesis introduces XMR4EL, a modular and reproducible framework that treats NEL as eXtreme Multi-label Ranking (XMR): “organize → route → rank.” XMR4EL decouples semantic indexing, hierarchical routing, and label-level ranking behind stable interfaces, enabling plug-and-play components while preserving deterministic preprocessing, sparse-first modeling, and persisted artifacts for reproducibility. On automatically labeled disease corpora (Inst-100/Inst-500 for training; BC5CDR for testing), a 4-layer hierarchy achieves 61.7% Hit@100 at 195 ms/mention with beam=40, revealing a clear trade-off point between speed and quality near beams 30–40. Increasing per-label synonyms from 100 to 500 instances yields +8–13 Hit@100 points at practical beams, and added depth improves recall at matched latency by shrinking leaf scopes. Conclusion: XMR4EL demonstrates that an open, sparse-first XMR design is practical today. By combining effective mention detection (NER) with accurate concept grounding (NEL), it provides a reliable pipeline for linking disease mentions to their unique IDs, supporting high-quality biomedical information retrieval and integration. Opportunities for further improvements include calibration, document-level coherence, and label-side text encoders to boost top-1/5 accuracy.en
dc.formatapplication/pdf
dc.identifier.tid204177677
dc.identifier.urihttp://hdl.handle.net/10400.5/116943
dc.language.isoeng
dc.subjectBiomedical Entity Linking
dc.subjectExtreme Multi-label Ranking
dc.subjectCandidate Generation
dc.subjectBeam Search
dc.subjectHard Negative Mining
dc.titleSemantic Indexing of Descriptors, Partitioning and Classification for Mapping Biomedical Entitiesen
dc.typemaster thesis
dspace.entity.typePublication
rcaap.rightsopenAccess

Ficheiros

Principais
A mostrar 1 - 1 de 1
A carregar...
Miniatura
Nome:
TM_Joao_Vedor.pdf
Tamanho:
568.57 KB
Formato:
Adobe Portable Document Format