F.1. High-dimensional vectors are different

As you add dimensions to vectors going from 1D to 2D and even 3D, nothing much changes about the kinds of math you can do to find nearest neighbors quickly. Let’s talk a little bit about conventional approaches used by database indexes for 2D vectors and work our way up to high-dimensional vectors. That will help you see where the math breaks down (becomes impractical).

F.1.1. Vector space indexes and hashes

Indexes are designed to make looking up something easy. Real-value (float) indexes for things like latitude and longitude can’t rely on exact matches, like the index of words at the back of a textbook. For 2D real-valued data, most indexes use some sort of bounding box to divide a low-dimensional space into manageable chunks. The most common example of an index like this for two-dimensional geographic location information is the postal code systems used in various countries to collect mail addresses within contiguous regions (called ZIP Codes in the US).

You assign a number to each region in 2D space; even though postal code regions aren’t rectangular bounding boxes, they serve the same purpose. The military uses bounding boxes with its grid system for dividing up the globe into rectangular bounding boxes and assigning each one a number. In both US ZIP Codes and the military grid system, the numbers for these regions have semantic meaning.

The “locality sensitivity” of hashes like US ZIP Codes comes from the fact that numbers or hashes that are close to each other in ordinal value are close to each other in the 2D space that they’re for. For example, the first digit in a US ZIP Code identifies a region, such as the west coast or southwest or the US state they belong to. The next digit (combined with the first) uniquely identifies a particular state. The first three digits uniquely identify a region or large city within that state. Locality sensitivity of US zip codes continues all the way down to the “+4” suffix, which identifies a particular city block or even apartment building.[2]

2

The ZIP Code Wikipedia article contains a map that shows this locality sensitivity (https://en.wikipedia.org/wiki/ZIP_Code#Primary_state_prefixes).

The manual process and algorithm that produced the US ZIP Code system is equivalent to the locality sensitive hashing algorithms created for other vector spaces. Locality sensitive hashing algorithms define a way to produce these locality sensitive numbers. They use the coordinates of locations in a vector space so that the numerical value of the hashes are close to each other if the locations of the regions they map to in the vector space are also close to each other or even overlap. Locality sensitive hashes try to create those same mathematical properties like a high probability of collision and locality sensitivity that cryptographic hashing algorithms try to avoid.

F.1.2. High-dimensional thinking

Natural language vector spaces are high dimensional. Natural language captures all the complex concepts that humans think and talk about, including natural language processing itself. So when you want to squeeze all that complexity into a vector, you often discard some of that complexity so it’ll fit in your rigid vector-space box. And you keep adding dimensions to your vector to accommodate as much of the complexity of human thought and language as you can.

For example, bag-of-words vectors discard the information content contained in the order of words. This allows you to produce discrete high-dimensional vectors that you can index and search efficiently. You use binary search and indexing trees to detect whether or not particular keywords are present or absent in both your query and your corpus. This works even if you expand your keyword vocabulary to include all the words in a natural language. Web search engines often even include all the words in hundreds of natural languages at once. That’s why your web search can include Spanish and German words alongside English words all in the same query.

You learned in chapter 2 how to capture a bit of the complexity of the order of words by adding N-grams to your bag-of-words vector dimensions. And you learned in chapter 3 how to weight those millions of terms (words and N-grams) according to how important they are. This leaves you with millions of dimensions or “bins” in your vector space model of human languages.

But bag-of-words, TF-IDF, and regular expressions can’t understand you. They can only help you find the documents you’re looking for. So in chapter 4 through chapter 6, you allowed your vector spaces to become continuous. You squeezed some of the complexity of natural language into the gaps in the number line between the integer counts of words. You no longer relied on a rigid, discrete vocabulary to define the dimensions of your vector space. By grouping words into concepts or topics you reduced the dimensions of your vectors from millions down to hundreds. You created nessvectors that captured the femaleness and blueness and placeness of words and statements.

And there’s more. In chapters 7 through 10, you figured out how to capture not only word combinations, but long, complex word sequences in your vectors. Your shallow nessvectors became deep thought vectors when you learned about recurrent neural networks and long short-term memory.

But all this depth and complexity creates a problem. Continuous, dense, high-dimensional vectors like thought vectors cannot be indexed or searched efficiently. That’s why search engines don’t just answer your complex question in a millisecond. When you want to know the meaning of life, you have to have a conversation with a chatbot, or, perish the thought, another human being.

Why is that? Why can’t you index or search a high-dimensional continuous vector? Start with a 1D vector “space” and see how easy it is to index and search a single scalar value on a 1D number line. Then you can think about how to extend that 1D index to handle multiple dimensions. And you can keep going up in dimensionality from there until things break down.

A 1D index

Imagine a random distribution of 1D vectors—a bunch of random numbers. You can create a natural 1D bounding box that’s half the width of the overall space by cutting the number line in half. You could put all the positive values in one box and the negative values in another box. As long as you have a pretty good idea where the middle or centroid of your vector space is located (usually zero), each box will have about half the number of vectors in it.

Each of those bounding boxes could be split in half again to create a total of four boxes. If you kept that up, a few more of those divisions would create a binary search tree or a binary hash that’s sensitive to locality (where it’s located). For a 1D vector space, the average number of points in each box is pow(num_vectors/2, num _boxes). For 1D space, you need only about 32 levels (box sizes) for your boxes to index billions of points so that there are only a few in each box.

Each of your 1D vectors can have its own ZIP Code, an index value or locality sensitive hash. And all the vectors that are similar to each other will be nearby in a sorted list of all those hash values. That way you can compute the hash values for some new query and find it quickly in your database.

2D, 3D, 4-D indexes

Let’s add a dimension and see how well the 1D binary tree index will work. Think about how you’d divide the space into regions in a binary tree, dividing your region approximately in half with each fork in the tree. Which dimension would you cut in half each time you tried to reduce the number points by half? For a 2D vector this would be the 2 x 2 squares or quadrants of 2D plane. For a 3D vector this might be the 3 x 3 x 3 blocks in a “Rubix Cube” of space. For 4-D you’d need about 4 x 4 x 4 x 4 blocks... to get started. The first fork in your binary tree index would create 4^4^ branches. And some of those 256 bounding boxes in your 4-D vector space might not contain any vectors at all. Some word combinations or sequences never occur.

Our naive binary tree approach works OK for 3D and 4-D vectors and even all the way out to 8-D or more. But it quickly gets unruly and inefficient. Imagine what your bounding “cubes” would be like in 10 dimensions. You’re not alone if your brain can’t handle that concept. Human brains live in a 3D world, so they aren’t capable of fully grasping even 4-D vector space concepts.

Machines can handle 10-D OK, but you need them to handle 100-D or more if you want to squeeze the complexity of human thought into vectors. You can think of this curse of dimensionality in a few different ways:

  • The possible combinations of dimensions grows exponentially with each added dimension.
  • All vectors are far away from each other in high-dimensional spaces.
  • High-dimensional vector spaces are mostly empty space—a random bounding box is almost always empty.

The following code may help you get a feel for these properties of high-dimensional spaces.

Listing F.1. Explore high-dimensional space
>>> import pandas as pd
>>> import numpy as np
>>> from tqdm import tqdm
 
>>> num_vecs = 100000
>>> num_radii = 20
>>> num_dim_list = [2, 4, 8, 18, 32, 64, 128]
>>> radii = np.array(list(range(1, num_radii + 1)))
>>> radii = radii / len(radii)
>>> counts = np.zeros((len(radii), len(num_dims_list)))
>>> rand = np.random.rand
>>> for j, num_dims in enumerate(tqdm(num_dim_list)):
...     x = rand(num_vecs, num_dims)
...     denom = (1. / np.linalg.norm(x, axis=1))                 1
...     x *= denom.reshape(-1, 1).dot(np.ones((1, x.shape[1])))
...     for i, r in enumerate(radii):
...         mask = (-r < x) & (x < r)
...         counts[i, j] = (mask.sum(axis=1) == mask.shape[1]).sum()

  • 1 Normalize a table of random row vectors to all have unit length.

You can explore this weird world of high-dimensional spaces in nlpia/book/examples/ch_app_h.py on github (http://github.com/totalgood/nlpia). You can see much of the weirdness in the following table, which shows the density of points in each bounding box as you expand its size bit by bit:

>>> df = pd.DataFrame(counts, index=radii, columns=num_dim_list) / num_vecs
>>> df = df.round(2)
>>> df[df == 0] = ''
>>> df
       2     4     8     18    32   64    128
0.05
0.10
0.15                                     0.37
0.20                                0.1     1
0.25                                  1     1
0.30                          0.55    1     1
0.35                    0.12  0.98    1     1
0.40                    0.62     1    1     1
0.45              0.03  0.92     1    1     1
0.50               0.2  0.99     1    1     1
0.55        0.01   0.5     1     1    1     1
0.60        0.08  0.75     1     1    1     1
0.65        0.24  0.89     1     1    1     1
0.70        0.45  0.96     1     1    1     1
0.75  0.12  0.64  0.99     1     1    1     1
0.80  0.25  0.78     1     1     1    1     1
0.85  0.38  0.88     1     1     1    1     1
0.90  0.51  0.94     1     1     1    1     1
0.95  0.67  0.98     1     1     1    1     1
1.00     1     1     1     1     1    1     1

There’s an indexing algorithm called a KD-Tree (https://en.wikipedia.org/wiki/K-d_tree) that attempts to divide up high-dimensional spaces as efficiently as possible to minimize empty bounding boxes. But even these approaches break down at dozens or hundreds of dimensions as the curse of dimensionality kicks in. Unlike 2D and 3D vectors, it’s not possible to truly index or hash high-dimensional word and thought vectors in a way that allows you to retrieve the closest matches quickly. You have to just calculate the distance to a lot of guesses for the nearest neighbors until you find a few that are close. Or you have to check them all, if you want to be sure you didn’t miss any.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset