COSMIC: A Framework for taking the census of star clusters in the Milky Way

Dias, Ariana Ferreira

Publicação

COSMIC: A Framework for taking the census of star clusters in the Milky Way

2025Dissertação de mestrado

dc.contributor.author	Dias, Ariana Ferreira
dc.contributor.institution	Faculty of Sciences
dc.contributor.institution	Department of Informatics
dc.contributor.supervisor	Barros, Márcia Cristina Afonso
dc.contributor.supervisor	Almeida, André Maria da Silva Dias Moitinho de
dc.date.accessioned	2026-01-20T10:50:01Z
dc.date.available	2026-01-20T10:50:01Z
dc.date.issued	2025
dc.description	Tese de mestrado, Informática, 2025, Universidade de Lisboa, Faculdade de Ciências
dc.description.abstract	With the wealth of data provided by the Gaia mission of the European Space Agency (ESA), including astrometry and photometry for approximately two billion stars, the interest in open star clusters has grown significantly, becoming a highly relevant research topic. The availability of precise Gaia data, advances in machine learning, increased computational power, and open-source software have led to a surge in cluster discoveries, often published in separate catalogues with varying levels of crossmatching rigour, often duplicating previously reported clusters. To address this challenge, we developed a framework to integrate, clean, and crossmatch multiple catalogues, producing a final compiled catalogue based on cluster memberships rather than solely on centre coordinates and radii. The system uses three interlinked databases: raw data storage, a data warehouse with cleaned and normalised data, and a final compiled catalogue. Data extraction uses NASA/ADS and CDS/VizieR APIs, and Gaia Archive queries validate member IDs, recovering approximately 97 % of stars via cone searches when Gaia IDs are missing. Crossmatching is guided by similarity metrics (Jaccard, Dice, Overlap), clustering quality measure (silhouette score), and distribution tests (Kolmogorov-Smirnov test and Jensen–Shannon divergence) for parallax, proper motions, magnitude, and BP-RP colour. A baseline dataset of manually labelled cluster pairs was used to train supervised machine learning models, including logistic regression, random forest, XGBoost, and SVM, with XGBoost performing best. The framework reduced 25577 initial clusters to 12310 unique clusters, of which 48.1% are new and 51.9% previously known, totalling 3456379 members. A web application allows querying the final catalogue and accessing original catalogue data. Future improvements include integrating additional surveys, exploring alternative machine learning models, optimising the Extraction, Transformation and Loading process, refining cluster merging rules, enhancing the determination of cluster membership probabilities and web application functionality. Overall, the framework provides a scalable and robust pipeline for consolidating open cluster catalogues, producing a curated dataset essential for galactic structure and evolution studies.	en
dc.format	application/pdf
dc.identifier.tid	204176140
dc.identifier.uri	http://hdl.handle.net/10400.5/116725
dc.language.iso	eng
dc.subject	Open Clusters
dc.subject	ETL Process
dc.subject	Crossmatching
dc.subject	Machine Learning
dc.subject	Framework
dc.title	COSMIC: A Framework for taking the census of star clusters in the Milky Way	en
dc.type	master thesis
dspace.entity.type	Publication
rcaap.rights	openAccess

Ficheiros

Principais

A mostrar 1 - 1 de 1

Nome:: TM_Ariana_Dias.pdf
Tamanho:: 21.93 MB
Formato:: Adobe Portable Document Format

Ver/Abrir

Coleções

Pure > Dspace
PURE > Dspace - Faculdade de Ciências