COSMIC: A Framework for taking the census of star clusters in the Milky Way

Dias, Ariana Ferreira

http://hdl.handle.net/10400.5/116725

Utilize este identificador para referenciar este registo.

Nome:	Descrição:	Tamanho:	Formato:
TM_Ariana_Dias.pdf		21.93 MB	Adobe PDF	Ver/Abrir

Contacte-nos

Autores

Dias, Ariana Ferreira

Resumo(s)

With the wealth of data provided by the Gaia mission of the European Space Agency (ESA), including astrometry and photometry for approximately two billion stars, the interest in open star clusters has grown significantly, becoming a highly relevant research topic. The availability of precise Gaia data, advances in machine learning, increased computational power, and open-source software have led to a surge in cluster discoveries, often published in separate catalogues with varying levels of crossmatching rigour, often duplicating previously reported clusters. To address this challenge, we developed a framework to integrate, clean, and crossmatch multiple catalogues, producing a final compiled catalogue based on cluster memberships rather than solely on centre coordinates and radii. The system uses three interlinked databases: raw data storage, a data warehouse with cleaned and normalised data, and a final compiled catalogue. Data extraction uses NASA/ADS and CDS/VizieR APIs, and Gaia Archive queries validate member IDs, recovering approximately 97 % of stars via cone searches when Gaia IDs are missing. Crossmatching is guided by similarity metrics (Jaccard, Dice, Overlap), clustering quality measure (silhouette score), and distribution tests (Kolmogorov-Smirnov test and Jensen–Shannon divergence) for parallax, proper motions, magnitude, and BP-RP colour. A baseline dataset of manually labelled cluster pairs was used to train supervised machine learning models, including logistic regression, random forest, XGBoost, and SVM, with XGBoost performing best. The framework reduced 25577 initial clusters to 12310 unique clusters, of which 48.1% are new and 51.9% previously known, totalling 3456379 members. A web application allows querying the final catalogue and accessing original catalogue data. Future improvements include integrating additional surveys, exploring alternative machine learning models, optimising the Extraction, Transformation and Loading process, refining cluster merging rules, enhancing the determination of cluster membership probabilities and web application functionality. Overall, the framework provides a scalable and robust pipeline for consolidating open cluster catalogues, producing a curated dataset essential for galactic structure and evolution studies.

Descrição

Tese de mestrado, Informática, 2025, Universidade de Lisboa, Faculdade de Ciências

Palavras-chave

Open Clusters ETL Process Crossmatching Machine Learning Framework

URI

http://hdl.handle.net/10400.5/116725

Coleções

Pure > Dspace
PURE > Dspace - Faculdade de Ciências

Ver registo completo