Repository logo
 
No Thumbnail Available
Publication

Large-Scale Distributed Similarity Search with Locality-Sensitive Hashing

Use this identifier to reference this record.
Name:Description:Size:Format: 
TM_João_Queimado.pdf3.15 MBAdobe PDF Download

Abstract(s)

Locality sensitive hashing (LSH) is an efficient hash-based solution for the Approximate Nearest Neighbor problem in high-dimensional environments. The recent popularity growth of the Internet brought its platforms an increase in online users. The user-base growth of Internet platforms increases the amount of user generated information produced by these platforms. This type of information accumulates into datasets that store large quantities of multidimensional objects, i.e., objects defined by several features. To keep search latencies as low as possible, faster and more efficient search methods need to be developed to counteract the big multidimensional datasets performance issues. LSH archives fast search times by indexing similar data objects close to each other but incurs extra computational steps and additional data to be stored. The high storage demand of traditional LSH algorithms in big-data environments vouch for the development of distributed and resource-efficient LSH solutions. In this thesis, we propose a distributed LSH approach that focuses on balancing storage resources with low storage overhead and without compromising LSH’s search efficiency. The proposed solution uses erasure codes to store object information directly into the LSH’s index nodes removing the need for additional storage structures and providing fault-tolerance. To further increase the proposed solution’s performance, we also include a process optimization that reduces query latency by directly comparing LSH hash values instead of object values. Additionally, we also develop an open source generalized framework capable of implementing multiple distributed LSH approaches using several LSH hashing algorithms and storage backends. We evaluated the proposed model and its optimized variant and proved not only that the proposed storage solution balances storage resources through the system nodes, but that also scales with the size of objects. Additionally, this evaluation also shows that the proposed optimization significantly increases the solutions performance, at some service quality cost.

Description

Tese de mestrado, Engenharia Informática , 2023, Universidade de Lisboa, Faculdade de Ciências

Keywords

Locality Sensitive Hashing (LSH) Cliente-Servidor, Big-data Approximate Nearest Neighbour (ANN) Erasure codes Teses de mestrado - 2024

Pedagogical Context

Citation

Research Projects

Organizational Units

Journal Issue

Publisher

CC License