F.2. High-dimensional indexing

In high-dimensional space, conventional indexes that rely on bounding boxes fail. Eventually, even locality sensitive hashing fails. But let’s first experiment with locality sensitive hashing to show its limitations. Then you will learn how to get around those limitations by giving up on the idea of a perfect index. You will create an approximate index after an experiment with locality sensitive hashing.

F.2.1. Locality sensitive hashing

In figure F.1, we constructed 400,000 completely random vectors, each with 200 dimensions (typical for topic vectors for a large corpus). And we indexed them with the Python LSHash package (pip install lshash3). Now imagine that you have a search engine that wants to find all the topic vectors that are close to a query topic vector. How many will be gathered up by the locality sensitive hash? And for what number of dimensions for the topic vectors do your search results cease to make much sense at all?

Figure F.1. Semantic search with LSHash

You can’t get many search results correct once the number of dimensions gets significantly above 10 or so. If you’d like to play with this yourself, or try your hand at building a better LSH algorithm, the code for running experiments like this is available in the nlpia package. And the lshash3 package is open source, with only about 100 lines of code at the heart of it.

F.2.2. Approximate nearest neighbors

Approximate nearest neighbor search is the latest answer to the high-dimensional vector space problem. The approximate hashes are similar to locality sensitive hashes and KD-trees, but they rely on something that’s more like a random forest algorithm. They’re stochastic (random) approaches to splitting your vector space into smaller and smaller chunks of space.

The state of the art for finding the closest matches for high-dimensional vectors is Facebook’s FAISS package and Spotify’s Annoy package. Because Annoy is so easy to install and use, that’s what we chose to use for your chatbot. In addition to it being the workhorse for finding matches among vectors representing song metadata for music fans, Dark Horse Comics has also used Annoy to suggest comic books efficiently. We mentioned these tools in chapter 13.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset