| Nome: | Descrição: | Tamanho: | Formato: | |
|---|---|---|---|---|
| 1.28 MB | Adobe PDF |
Autores
Resumo(s)
Machine Learning is becoming ubiquitous, with its techniques finding usage in every
part of society. We are now witnessing an explosion in ML-based tools, such as the
popular ChatGPT, made possible by advances in hardware that enable large-scale data
processing.
Most importantly, the rise of Machine Learning is related to the release of multiple
frameworks and libraries that abstract its complexities, thus increasing its accessibility.
These tools are used to implement the pipelines that automate the necessary workflow to
create an ML mode, from data preprocessing to model learning and evaluation.
However, these pipelines can contain domain-specific defects that are not trivial to be
found by looking at the code. These defects are caused by flawed methodologies related
to the semantics of pipeline components, data or other concepts specific to data science.
An example of such a defect is the incorrect handling of time-series data when building
datasets, such as shuffling time-series instances before the train/test splitting. Semantic defects are difficult to detect and prevent, reaching production silently, thus causing
training-serving skew. Unfortunately, unlike typical software development, pipeline testing is not feasible, forcing us to explore alternatives.
With a focus on supervised machine learning, this work identified relevant semantic
defects, resorting to the community of ML developers, data scientists, and the academic
and grey literature. To tackle the defects, we developed a domain-specific language capable of describing pipeline structure and the properties of its components and data sources.
We also created a static analyser to automate defect detection in pipelines specified using
the DSL. The verification process relies on the formal specification of pipeline components.
We modelled pipelines containing the relevant defects we identified to evaluate the
solution. The solution successfully detected all the defects present in the pipelines.
Descrição
Tese de mestrado, Engenharia Informática, 2024, Universidade de Lisboa, Faculdade de Ciências
Palavras-chave
Verificação Estática Linguagem Específica de Domínio Aprendizagem Automática Pipeline Especificação Formal Teses de mestrado - 2024
