Static Analysis for Detection of Defects in Machine Learning Pipelines

Silva, Pedro Miguel Alcântara da

Publicação

Static Analysis for Detection of Defects in Machine Learning Pipelines

2024Dissertação de mestrado

datacite.subject.fos	Departamento de Informática	pt_PT
dc.contributor.advisor	Fonseca, Alcides Miguel Cachulo Aguiar
dc.contributor.advisor	Lopes, Maria Antónia Bacelar da Costa, 1968-
dc.contributor.author	Silva, Pedro Miguel Alcântara da
dc.date.accessioned	2025-01-17T12:48:25Z
dc.date.available	2025-01-17T12:48:25Z
dc.date.issued	2024
dc.date.submitted	2024
dc.description	Tese de mestrado, Engenharia Informática, 2024, Universidade de Lisboa, Faculdade de Ciências	pt_PT
dc.description.abstract	Machine Learning is becoming ubiquitous, with its techniques finding usage in every part of society. We are now witnessing an explosion in ML-based tools, such as the popular ChatGPT, made possible by advances in hardware that enable large-scale data processing. Most importantly, the rise of Machine Learning is related to the release of multiple frameworks and libraries that abstract its complexities, thus increasing its accessibility. These tools are used to implement the pipelines that automate the necessary workflow to create an ML mode, from data preprocessing to model learning and evaluation. However, these pipelines can contain domain-specific defects that are not trivial to be found by looking at the code. These defects are caused by flawed methodologies related to the semantics of pipeline components, data or other concepts specific to data science. An example of such a defect is the incorrect handling of time-series data when building datasets, such as shuffling time-series instances before the train/test splitting. Semantic defects are difficult to detect and prevent, reaching production silently, thus causing training-serving skew. Unfortunately, unlike typical software development, pipeline testing is not feasible, forcing us to explore alternatives. With a focus on supervised machine learning, this work identified relevant semantic defects, resorting to the community of ML developers, data scientists, and the academic and grey literature. To tackle the defects, we developed a domain-specific language capable of describing pipeline structure and the properties of its components and data sources. We also created a static analyser to automate defect detection in pipelines specified using the DSL. The verification process relies on the formal specification of pipeline components. We modelled pipelines containing the relevant defects we identified to evaluate the solution. The solution successfully detected all the defects present in the pipelines.	pt_PT
dc.identifier.tid	203875524	pt_PT
dc.identifier.uri	http://hdl.handle.net/10400.5/97300
dc.language.iso	eng	pt_PT
dc.subject	Verificação Estática	pt_PT
dc.subject	Linguagem Específica de Domínio	pt_PT
dc.subject	Aprendizagem Automática	pt_PT
dc.subject	Pipeline	pt_PT
dc.subject	Especificação Formal	pt_PT
dc.subject	Teses de mestrado - 2024	pt_PT
dc.title	Static Analysis for Detection of Defects in Machine Learning Pipelines	pt_PT
dc.type	master thesis
dspace.entity.type	Publication
rcaap.rights	openAccess	pt_PT
rcaap.type	masterThesis	pt_PT
thesis.degree.name	Tese de mestrado em Engenharia Informática	pt_PT

Ficheiros

Principais

A mostrar 1 - 1 de 1

Nome:: TM_Pedro_Silva.pdf
Tamanho:: 1.28 MB
Formato:: Adobe Portable Document Format

Ver/Abrir

Licença

A mostrar 1 - 1 de 1

Nome:: license.txt
Tamanho:: 1.2 KB
Formato:: Item-specific license agreed upon to submission
Descrição:

Ver/Abrir

Coleções

FC-DI - Master Thesis (dissertation)