... annoy.

The nearest neighbor problem is very common in data science. It's useful in recommender situations but also with neural embeddings in general. It's an expensive thing to calculate so it is common to calculate approximate distances as a proxy. In python a very likeable tool for this is annoy.

Episode Notes

The full API is defined here.

If you want to pick up the full notebook of everything done here you can find it over here.


Metrics can be angular, euclidean, manhattan, hamming, or dot.


If you want to save and load from disk you can use this code;

columns = 2
vecs = np.concatenate([
    np.random.normal(-1, 1, (5000, columns)), 
    np.random.normal(0, 0.5, (5000, columns)),

metric = 'euclidean'

annoy = AnnoyIndex(columns, metric)
for i in range(vecs.shape[0]):
    annoy.add_item(i, vecs[i, :])

# here we save the annoy index

# next we make a new object with the same settings
annoy_from_disk = AnnoyIndex(columns, metric)
# here we load it in again

Feedback? See an issue? Feel free to mention it here.

If you want to be kept up to date, consider getting the newsletter.