logo

... annoy.


The nearest neighbor problem is very common in data science. It's useful in recommender situations but also with neural embeddings in general. It's an expensive thing to calculate so it is common to calculate approximate distances as a proxy. In python a very likeable tool for this is annoy.


Notes

Here's the benchmark code for the video.

from sklearn.neighbors import NearestNeighbors

# this is the original query 
query = np.array([-2., -2.])
# scikit learn needs it to be wrapped
q = np.array([query])
# we will retreive 10 neighbors in each case
n = 10

Scikit-Learn Balltree

This code builds the object from scikit-learn.

nn = NearestNeighbors(n_neighbors=n, algorithm='ball_tree').fit(vecs)

Here we time the retreival.

%%timeit 
distances, indices = nn.kneighbors(q)

Scikit-Learn KD-tree

This code builds the object from scikit-learn.

nn = NearestNeighbors(n_neighbors=n, algorithm='kd_tree').fit(vecs)

Here we time the retreival.

%%timeit 
distances, indices = nn.kneighbors(q)

Scikit-Learn Brute Force

This code builds the object from scikit-learn.

nn = NearestNeighbors(n_neighbors=n, algorithm='brute').fit(vecs)

Here we time the retreival.

%%timeit 
distances, indices = nn.kneighbors(q)

Annoy with 10 trees

This code builds the index.

annoy = AnnoyIndex(columns, 'euclidean')
for i in range(vecs.shape[0]):
    annoy.add_item(i, vecs[i, :])
annoy.build(n_trees=10)

Here we time the retreival.

%%timeit 
annoy.get_nns_by_vector(query, n)

Annoy with 1 tree

This code builds the index.

annoy = AnnoyIndex(columns, 'euclidean')
for i in range(vecs.shape[0]):
    annoy.add_item(i, vecs[i, :])
annoy.build(n_trees=1)

Here we time the retreival

%%timeit 
annoy.get_nns_by_vector(query, n)

Feedback? See an issue? Something unclear? Feel free to mention it here.

If you want to be kept up to date, consider getting the newsletter.