Rewriting the code with NumPy

NumPy is a library that's used for fast numeric computation and serves as a foundation for Python's scientific ecosystem. It's also the backbone for SciPy and Pandas. Since we have slow, numeric code, NumPy is a great place to start with your optimization attempts.

The algorithm is mostly written in NumPy already—we couldn't perform a true closest-N search in pandas since it doesn't support multidimensional indexing. However, there is one low-hanging fruit: our naive model uses argsort to pick the N closest records, which does sort the whole dataset. We don't need sorting, even for those N closest ones—let alone any other element. Here, we can swap the np.argsort method with np.argpartition. This function does exactly what we want—it puts the N smallest distances first (no matter the order) and keeps all the rest to the right:

def _closest_N2(X1, X2, N=1):
    matrix = euclidean_distances(X1, X2)
    return np.argpartition(matrix, kth=N, axis=1)[:, :N]

To ensure that the functions are interchangeable, let's write a simple test function:

def _test_closest(f):
    x1 = pd.DataFrame({'a':[1,2], 'b':[20,10]})
    x2 = pd.DataFrame({'a':[2,1, 0], 'b':[10,20, 25]})
 
    answer = np.array([[1,0, 0]]).T
    assert np.all(f(x2, x1, N=1) == answer)

_test_closest(_closest_N2)

Feel free to add more test cases (this is where you can leverage PyTest suites)!

Now, we can create a new version of the KNN by using this new function:

class numpyNearestNeighbour(NearestNeighbor):
     
    def predict(self, X):
            closest = _closest_N2(X, self.X, N=self.N)
            return np.mean(np.take(ytrain.values, closest), axis=1)

Note that we also got rid of pd.Series. This will speed up the algorithm, but you'll probably have to wrap values to the series outside. Let's get our customers to decide on that.

Now, let's see how that version performs on the same dataset:

>>> numpyKNN = numpyNearestNeighbour(N=5)
>>> numpyKNN.fit(Xtrain.values, ytrain.values)

>>> %%timeit
>>> _ = numpyKNN.predict(Xtv)

448 ms ± 14.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

We went from 1.43 seconds to 448 ms—that's a boost of 69%! Let's look at the distribution by line:

>>> %lprun -f _closest_N2 numpyKNN.predict(Xtv)

Timer unit: 1e-06 s

Total time: 0.440021 s
File: <ipython-input-134-29fa1851d880>
Function: _closest_N2 at line 1

Line # Hits  Time        Per Hit   % Time    Line Contents
==============================================================
     1                                      def _closest_N2(X1, X2, N=1):
     2 1     212103.0    212103.0  48.2         matrix = euclidean_distances(X1, X2)
     3 1     227918.0    227918.0  51.8         return np.argpartition(matrix, kth=N, axis=1)[:, :N]

This time, it seems that the matrix and partition take approximately the same time (this will change for larger datasets, though). To summarize, vectorizing the code with NumPy allowed us to boost our computations by 68%—all while making our code cleaner and more expressive. For most tasks, NumPy remains the first solution to try out—and often, the result is good enough already.

NumPy is essentially a foundation and industry standard for Python numeric computations. Many libraries are based on NumPy or interact with it. In fact, modern NumPy does a great deal of work defining the interface, allowing other libraries to plug in the actual computations and be interchangeable. One example of that is CuPy—a GPU-based alternative for NumPy with a near-identical interface.

If you want to dive deeper into NumPy-based computations, take a look at these resources:

The "NumPy" Approach, by James Powell: https://www.youtube.com/watch?v=8jixaYxo6kA
NumPy Essentials, by Leo (Liang-Huan) Chin and Tanmay Dutta: https://www.packtpub.com/big-data-and-business-intelligence/numpy-essentials

Table of Contents for Rewriting the code with NumPy

Create new playlist

Sign In

Sign Up

Table of Contents for
Rewriting the code with NumPy