Name: | Description: | Size: | Format: | |
---|---|---|---|---|
1.33 MB | Adobe PDF |
Advisor(s)
Abstract(s)
O objetivo deste trabalho consiste em aprimorar o algoritmo de correspondência de nomes
para superar as atuais limitações, utilizando a Distância de Levenshtein como base. Ao utilizar
este algoritmo isoladamente verificou-se uma grande ineficácia, resultando numa taxa de falsos
positivos de 99 %. Para solucionar este desafio, a Distância de Levenshtein foi combinada com
três métricas: Jaro-Winkler, N-Gram e Similaridade de Cosseno. Um experimento adicional com
a técnica Soft TF-IDF foi efetuada para comparar com os métodos anteriores.
A melhor combinação foi a Distância de Levenshtein com N-Gram, apresentando-se mais
eficaz na deteção de correspondências, considerando os parâmetros utilizados. A combinação
com Jaro-Winkler obteve uma taxa de falsos positivos de 14 %, mas teve dificuldade com nomes
em ordens diferentes. A Similaridade de Cosseno apresentou uma taxa semelhante à do N-Gram
(25 %), mas ocorreu uma alteração no valor dos pesos das funções. O Soft TF-IDF, teve eficácia
na identificação de similaridade, mas obteve uma taxa de falsos positivos de 45 %, tornando-o o
menos eficiente.
O conjunto de dados inicial era pequeno, pelo que foi necessário realizar um teste com um
conjunto maior onde o algoritmo eleito obteve uma taxa de falsos positivos de 13 % e um tempo
de processamento de 49 minutos, confirmando a sua robustez e escalabilidade.
Deste modo, ocorreu uma melhoria uma melhoria mínima de 75 % em relação à taxa inicial
de falsos positivos, beneficiando o vendedor e o cliente, ao garantir um programa eficiente.
The objective of this work is to improve the name-matching algorithm to overcome its current limitations by using the Levenshtein Distance as a foundation. When using this algorithm in isolation, it proved to be highly ineffective, resulting in a false positive rate of 99 %. To address this challenge, the Levenshtein Distance was combined with three metrics: Jaro-Winkler, N-Gram, and Cosine Similarity. An additional experiment with the Soft TF-IDF technique was conducted to compare it with the previous methods. The best combination was the Levenshtein Distance with N-Gram, proving to be the most effective in detecting matches based on the parameters used. The combination with Jaro-Winkler resulted in a false positive rate of 14 % but struggled with names in different orders. Cosine Similarity had a rate similar to that of N-Gram (25 %), but there was a change in the weight values of the functions. Soft TF-IDF was effective in identifying similarity but had a false positive rate of 45 %, making it the least efficient. The initial dataset was small, so it was necessary to conduct a test with a larger dataset, where the chosen algorithm achieved a false positive rate of 13 % and a processing time of 49 minutes, confirming its robustness and scalability. Thus, there was a minimum improvement of 75 % compared to the initial false positive rate, benefiting both the seller and the customer by ensuring an efficient program.
The objective of this work is to improve the name-matching algorithm to overcome its current limitations by using the Levenshtein Distance as a foundation. When using this algorithm in isolation, it proved to be highly ineffective, resulting in a false positive rate of 99 %. To address this challenge, the Levenshtein Distance was combined with three metrics: Jaro-Winkler, N-Gram, and Cosine Similarity. An additional experiment with the Soft TF-IDF technique was conducted to compare it with the previous methods. The best combination was the Levenshtein Distance with N-Gram, proving to be the most effective in detecting matches based on the parameters used. The combination with Jaro-Winkler resulted in a false positive rate of 14 % but struggled with names in different orders. Cosine Similarity had a rate similar to that of N-Gram (25 %), but there was a change in the weight values of the functions. Soft TF-IDF was effective in identifying similarity but had a false positive rate of 45 %, making it the least efficient. The initial dataset was small, so it was necessary to conduct a test with a larger dataset, where the chosen algorithm achieved a false positive rate of 13 % and a processing time of 49 minutes, confirming its robustness and scalability. Thus, there was a minimum improvement of 75 % compared to the initial false positive rate, benefiting both the seller and the customer by ensuring an efficient program.
Description
Trabalho de projeto de mestrado, Ciências de Dados, 2025, Universidade de Lisboa, Faculdade de Ciências
Keywords
Correspondência de Nomes Distância de Levenshtein N-Gram Falsos positivos Trabalhos de projeto de mestrado - 2025