Name: | Description: | Size: | Format: | |
---|---|---|---|---|
3.15 MB | Adobe PDF |
Authors
Abstract(s)
Locality sensitive hashing (LSH) is an efficient hash-based solution for the Approximate Nearest Neighbor problem in high-dimensional environments. The recent popularity growth of the Internet brought its platforms an increase in online users. The user-base
growth of Internet platforms increases the amount of user generated information produced by these platforms. This type of information accumulates into datasets that store
large quantities of multidimensional objects, i.e., objects defined by several features. To
keep search latencies as low as possible, faster and more efficient search methods need to
be developed to counteract the big multidimensional datasets performance issues. LSH
archives fast search times by indexing similar data objects close to each other but incurs
extra computational steps and additional data to be stored. The high storage demand of
traditional LSH algorithms in big-data environments vouch for the development of distributed and resource-efficient LSH solutions. In this thesis, we propose a distributed
LSH approach that focuses on balancing storage resources with low storage overhead and
without compromising LSH’s search efficiency. The proposed solution uses erasure codes
to store object information directly into the LSH’s index nodes removing the need for additional storage structures and providing fault-tolerance. To further increase the proposed
solution’s performance, we also include a process optimization that reduces query latency
by directly comparing LSH hash values instead of object values. Additionally, we also
develop an open source generalized framework capable of implementing multiple distributed LSH approaches using several LSH hashing algorithms and storage backends.
We evaluated the proposed model and its optimized variant and proved not only that the
proposed storage solution balances storage resources through the system nodes, but that
also scales with the size of objects. Additionally, this evaluation also shows that the
proposed optimization significantly increases the solutions performance, at some service
quality cost.
Description
Tese de mestrado, Engenharia Informática , 2023, Universidade de Lisboa, Faculdade de Ciências
Keywords
Locality Sensitive Hashing (LSH) Cliente-Servidor, Big-data Approximate Nearest Neighbour (ANN) Erasure codes Teses de mestrado - 2024