Vedor, João Pedro Ramos2026-02-092026-02-092025http://hdl.handle.net/10400.5/116943Tese de mestrado, Engenharia Informática, 2025, Universidade de Lisboa, Faculdade de CiênciasImagine reading a clinical note and encountering the term ALS.” Is it amyotrophic lateral sclerosis (a disease), advanced life support (a procedure), or even a gene symbol? Traditional Named Entity Recognition (NER) can identify that ALS” is a mention of a biomedical entity, but it does not disambiguate which specific concept it refers to. Named Entity Linking (NEL) is required to map the mention to a unique identifier in an ontology such as UMLS or MeSH, ensuring that downstream systems understand precisely which disease, procedure, or gene is referenced. For example, correctly linking “ALS” to the UMLS concept for amyotrophic lateral sclerosis allows consistent retrieval, reasoning, and integration across biomedical datasets. Scaling NEL across articles, clinical narratives, and curated resources is challenging due to the size of ontologies, acronym collisions, synonyms, and ambiguous mentions (e.g., “Lou Gehrig’s disease” vs. “ALS”). Motivated by these challenges, this thesis introduces XMR4EL, a modular and reproducible framework that treats NEL as eXtreme Multi-label Ranking (XMR): “organize → route → rank.” XMR4EL decouples semantic indexing, hierarchical routing, and label-level ranking behind stable interfaces, enabling plug-and-play components while preserving deterministic preprocessing, sparse-first modeling, and persisted artifacts for reproducibility. On automatically labeled disease corpora (Inst-100/Inst-500 for training; BC5CDR for testing), a 4-layer hierarchy achieves 61.7% Hit@100 at 195 ms/mention with beam=40, revealing a clear trade-off point between speed and quality near beams 30–40. Increasing per-label synonyms from 100 to 500 instances yields +8–13 Hit@100 points at practical beams, and added depth improves recall at matched latency by shrinking leaf scopes. Conclusion: XMR4EL demonstrates that an open, sparse-first XMR design is practical today. By combining effective mention detection (NER) with accurate concept grounding (NEL), it provides a reliable pipeline for linking disease mentions to their unique IDs, supporting high-quality biomedical information retrieval and integration. Opportunities for further improvements include calibration, document-level coherence, and label-side text encoders to boost top-1/5 accuracy.application/pdfengBiomedical Entity LinkingExtreme Multi-label RankingCandidate GenerationBeam SearchHard Negative MiningSemantic Indexing of Descriptors, Partitioning and Classification for Mapping Biomedical Entitiesmaster thesis204177677