Back to main.

Calmcode Shorts

lunr.py logolunr.py

When you're using the search functionality on a mkdocs material site then you're using lunr.js in the background. It's a lightweight search engine that can handle smaller datasets pretty well. Lunr.py provides a backend solution, allowing you to parse the documents in Python ahead of time and create a Lunr.js index that you can use locally or pass to Lunr.js.

Dataset

Let's grab a dataset to give a demo of lunr.

import pandas as pd

df = pd.read_csv("https://calmcode.io/datasets/clinc.csv").assign(idx=lambda d: d.index)
df.sample(3)
text label idx
what is my intetest rate interest_rate 16717
go through all the reminders on my list and state what they are reminder 9521
what ingredients go in a milky way ingredients_list 14794

Building an Index

We'll use this text column as a string we'd like to query against. Lunr assumes a list of dictionaries as a data structure, so we'll convert our dataframe first.

documents = df.to_dict(orient="records")

Next, we can create an index.

from lunr import lunr

index = lunr(ref='idx', fields=('text',), documents=documents)

The lunr function has three parameters.

  • ref is the key in the documents to be used sa the reference.
  • fields is a sequence of keys to index from the documents.
  • documents is the list of dictionaries that resemble the documents to be indexed.

This gives us an index variable that we can use to query our data.

index.search('spanish')
# [{'ref': '4501', 'score': 7.801, 'match_data': <MatchData "spanish">},
#  {'ref': '3', 'score': 7.62, 'match_data': <MatchData "spanish">},
#  {'ref': '26', 'score': 7.62, 'match_data': <MatchData "spanish">},
#  ...
#  {'ref': '19726', 'score': 5.065, 'match_data': <MatchData "spanish">}]

This index gives us a score as well as a reference to our original data. We can re-use this to get our original documents again.

[documents[int(i['ref'])] for i in index.search('spanish')]
# [{'text': "can you tell me how to say 'i do not speak much spanish', in spanish",
#  'label': 'translate',
#  'idx': 4501},
# {'text': 'how do you say fast in spanish', 'label': 'translate', 'idx': 3},
# {'text': 'what is dog in spanish', 'label': 'translate', 'idx': 26},
# ...
# {'text': 'please change your language setting to spanish now',
#  'label': 'change_language',
#  'idx': 19726}]

Saving and Loading

Once computed, you can also store the search index on disk to reload later.

import json
from lunr.index import Index

serialized = idx.serialize()

# Save the index
with open('idx.json', 'w') as fd:
    json.dump(serialized, fd)

# Load it again
with open("idx.json") as fd:
    reloaded = json.loads(fd.read())

idx = Index.load(reloaded)
idx.search("plant")

Benchmark

We were curious about the performance statistics so we ran some comparisons.

%timeit df.loc[lambda d: d['text'].str.contains("spanish")]
# 4.79 ms ± 37 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit [d for d in documents if 'spanish' in d['text']]
# 1.86 ms ± 32.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit index.search('spanish')
# 304 µs ± 1.85 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%timeit [documents[int(i['ref'])] for i in index.search('spanish')]
# 309 µs ± 1.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Lunr is faster to retreive data that either a list comprehension or a Pandas query, even if we re-use the index to fetch the original documents.

Usecase

Lunr is great for smaller datasets that fit on a single machine and for rapid prototyping. It doesn't have great support for typos and it won't scale once your dataset grows bigger.

There's certainly some more features to explore in this tool, if you're curious please check the documentation.


Back to main.