A carregar...
Projeto de investigação
Resource Aware Programming
Financiador
Autores
Publicações
Optimization of feature learning through grammar-guided genetic programming
Publication . Ingelse, Leon Kornelis; Fonseca, Alcides Aguiar
Machine Learning (ML) is becoming more prominent in daily life. A key aspect in ML is Feature Engineering (FE), which can entail a long and tedious process. Therefore, the automation of FE, known as
Feature Learning (FL), can be highly rewarding. FL methods need not only have high prediction performance, but should also produce interpretable methods. Many current high-performance ML methods
that can be considered FL methods, such as Neural Networks and PCA, lack interpretability.
A popular ML used for FL that produces interpretable models is Genetic Programming (GP), with
multiple successful applications and methods like M3GP. In this thesis, I present two new GP-based FL
methods, namely M3GP with Domain Knowledge (DK-M3GP) and DK-M3GP with feature Aggregation
(DKA-M3GP). Both use grammars to enhance the search process of GP, in a method called GrammarGuided GP (GGGP). DK-M3GP uses grammars to incorporate domain knowledge in the search process.
In particular, I use DK-M3GP to define what solutions are humanly valid, in this case by disallowing
operating arithmetically on categorical features. For example, the multiplication of the postal code of an
individual with their wage is not deemed sensible and thus disallowed.
In DKA-M3GP, I use grammars to include a feature aggregation method in the search space. This
method can be used for time series and panel datasets, to aggregate the target value of historic data based
on a known feature value of a new data point. For example, if I want to predict the number of bikes seen
daily in a city, it is interesting to know how many were seen on average in the last week. Furthermore,
DKA-M3GP allows for filtering the aggregation based on some other feature value. For example, we can
include the average number of bikes seen on past Sundays.
I evaluated my FL methods for two ML problems in two environments. First, I evaluate the independent FL process, and, after that, I evaluate the FL steps within four ML pipelines. Independently,
DK-M3GP shows a two-fold advantage over normal M3GP; better interpretability in general, and higher
prediction performance for one problem. DKA-M3GP has a much better prediction performance than
M3GP for one problem, and a slightly better one for the other. Furthermore, within the ML pipelines it
performed well in one of two problems. Overall, my methods show potential for FL.
Both methods are implemented in Genetic Engine an individual-representation-independent GGGP
framework, created as part of this thesis. Genetic Engine is completely implemented in Python and shows
competing performance with the mature GGGP framework PonyGE2.
Unidades organizacionais
Descrição
Palavras-chave
Contribuidores
Financiadores
Entidade financiadora
Fundação para a Ciência e a Tecnologia
Programa de financiamento
3599-PPCDT
Número da atribuição
EXPL/CCI-COM/1306/2021
