Lucene for Approximate Nearest-Neighbors Search on Arbitrary Dense Vectors release_7usdawnej5ezxagijd4kahrusu

by Tommaso Teofili, Jimmy Lin

Released as a article .

2019  

Abstract

We demonstrate three approaches for adapting the open-source Lucene search library to perform approximate nearest-neighbor search on arbitrary dense vectors, using similarity search on word embeddings as a case study. At its core, Lucene is built around inverted indexes of a document collection's (sparse) term-document matrix, which is incompatible with the lower-dimensional dense vectors that are common in deep learning applications. We evaluate three techniques to overcome these challenges that can all be natively integrated into Lucene: the creation of documents populated with fake words, LSH applied to lexical realizations of dense vectors, and k-d trees coupled with dimensionality reduction. Experiments show that the "fake words" approach represents the best balance between effectiveness and efficiency. These techniques are integrated into the Anserini open-source toolkit and made available to the community.
In text/plain format

Archived Files and Locations

application/pdf  91.2 kB
file_gddd2zvsvjbyppxwmosfmc2uwq
arxiv.org (repository)
web.archive.org (webarchive)
Read Archived PDF
Preserved and Accessible
Type  article
Stage   submitted
Date   2019-10-22
Version   v1
Language   en ?
arXiv  1910.10208v1
Work Entity
access all versions, variants, and formats of this works (eg, pre-prints)
Catalog Record
Revision: d99c7241-cba3-4032-9329-8eb3a81e3fd8
API URL: JSON