Rewriting the code with NumPy

NumPy is a library that's used for fast numeric computation and serves as a foundation for Python's scientific ecosystem. It's also the backbone for SciPy and Pandas. Since we have slow, numeric code, NumPy is a great place to start with your optimization attempts. 

The algorithm is mostly written in NumPy alreadywe couldn't perform a true closest-N search in pandas since it doesn't support multidimensional indexing. However, there is one low-hanging fruit: our naive model uses argsort to pick the N closest records, which does sort the whole dataset. We don't need sorting, even for those N closest oneslet alone any other element. Here, we can swap the np.argsort method with np.argpartition. This function does exactly what we wantit puts the N smallest distances first (no matter the order) and keeps all the rest to the right:

def _closest_N2(X1, X2, N=1):
matrix = euclidean_distances(X1, X2)
return np.argpartition(matrix, kth=N, axis=1)[:, :N]

To ensure that the functions are interchangeable, let's write a simple test function:

def _test_closest(f):
x1 = pd.DataFrame({'a':[1,2], 'b':[20,10]})
x2 = pd.DataFrame({'a':[2,1, 0], 'b':[10,20, 25]})

answer = np.array([[1,0, 0]]).T
assert np.all(f(x2, x1, N=1) == answer)

_test_closest(_closest_N2)

Feel free to add more test cases (this is where you can leverage PyTest suites)!

Now, we can create a new version of the KNN by using this new function:

class numpyNearestNeighbour(NearestNeighbor):

def predict(self, X):
closest = _closest_N2(X, self.X, N=self.N)
return np.mean(np.take(ytrain.values, closest), axis=1)

Note that we also got rid of pd.Series. This will speed up the algorithm, but you'll probably have to wrap values to the series outside. Let's get our customers to decide on that.

Now, let's see how that version performs on the same dataset:

>>> numpyKNN = numpyNearestNeighbour(N=5)
>>> numpyKNN.fit(Xtrain.values, ytrain.values)

>>> %%timeit
>>> _ = numpyKNN.predict(Xtv)

448 ms ± 14.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

We went from 1.43 seconds to 448 ms—that's a boost of 69%! Let's look at the distribution by line:

>>> %lprun -f _closest_N2 numpyKNN.predict(Xtv)

Timer unit: 1e-06 s

Total time: 0.440021 s
File: <ipython-input-134-29fa1851d880>
Function: _closest_N2 at line 1

Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def _closest_N2(X1, X2, N=1):
2 1 212103.0 212103.0 48.2 matrix = euclidean_distances(X1, X2)
3 1 227918.0 227918.0 51.8 return np.argpartition(matrix, kth=N, axis=1)[:, :N]

This time, it seems that the matrix and partition take approximately the same time (this will change for larger datasets, though). To summarize, vectorizing the code with NumPy allowed us to boost our computations by 68%all while making our code cleaner and more expressive. For most tasks, NumPy remains the first solution to try outand often, the result is good enough already. 

NumPy is essentially a foundation and industry standard for Python numeric computations. Many libraries are based on NumPy or interact with it. In fact, modern NumPy does a great deal of work defining the interface, allowing other libraries to plug in the actual computations and be interchangeable. One example of that is CuPya GPU-based alternative for NumPy with a near-identical interface.

If you want to dive deeper into NumPy-based computations, take a look at these resources:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset