When you're using the search functionality on a mkdocs material site then you're using lunr.js in the background. It's a lightweight search engine that can handle smaller datasets pretty well. Lunr.py provides a backend solution, allowing you to parse the documents in Python ahead of time and create a Lunr.js index that you can use locally or pass to Lunr.js.
Dataset
Let's grab a dataset to give a demo of lunr.
import pandas as pd
df = pd.read_csv("https://calmcode.io/datasets/clinc.csv").assign(idx=lambda d: d.index)
df.sample(3)
text | label | idx |
---|---|---|
what is my intetest rate | interest_rate | 16717 |
go through all the reminders on my list and state what they are | reminder | 9521 |
what ingredients go in a milky way | ingredients_list | 14794 |
Building an Index
We'll use this text
column as a string we'd like to query against. Lunr assumes
a list of dictionaries as a data structure, so we'll convert our dataframe first.
documents = df.to_dict(orient="records")
Next, we can create an index.
from lunr import lunr
index = lunr(ref='idx', fields=('text',), documents=documents)
The lunr
function has three parameters.
ref
is the key in the documents to be used sa the reference.fields
is a sequence of keys to index from the documents.documents
is the list of dictionaries that resemble the documents to be indexed.
This gives us an index
variable that we can use to query our data.
index.search('spanish')
# [{'ref': '4501', 'score': 7.801, 'match_data': <MatchData "spanish">},
# {'ref': '3', 'score': 7.62, 'match_data': <MatchData "spanish">},
# {'ref': '26', 'score': 7.62, 'match_data': <MatchData "spanish">},
# ...
# {'ref': '19726', 'score': 5.065, 'match_data': <MatchData "spanish">}]
This index gives us a score as well as a reference to our original data. We can re-use this to get our original documents again.
[documents[int(i['ref'])] for i in index.search('spanish')]
# [{'text': "can you tell me how to say 'i do not speak much spanish', in spanish",
# 'label': 'translate',
# 'idx': 4501},
# {'text': 'how do you say fast in spanish', 'label': 'translate', 'idx': 3},
# {'text': 'what is dog in spanish', 'label': 'translate', 'idx': 26},
# ...
# {'text': 'please change your language setting to spanish now',
# 'label': 'change_language',
# 'idx': 19726}]
Saving and Loading
Once computed, you can also store the search index on disk to reload later.
import json
from lunr.index import Index
serialized = idx.serialize()
# Save the index
with open('idx.json', 'w') as fd:
json.dump(serialized, fd)
# Load it again
with open("idx.json") as fd:
reloaded = json.loads(fd.read())
idx = Index.load(reloaded)
idx.search("plant")
Benchmark
We were curious about the performance statistics so we ran some comparisons.
%timeit df.loc[lambda d: d['text'].str.contains("spanish")]
# 4.79 ms ± 37 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit [d for d in documents if 'spanish' in d['text']]
# 1.86 ms ± 32.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit index.search('spanish')
# 304 µs ± 1.85 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit [documents[int(i['ref'])] for i in index.search('spanish')]
# 309 µs ± 1.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Lunr is faster to retreive data that either a list comprehension or a Pandas query, even if we re-use the index to fetch the original documents.
Usecase
Lunr is great for smaller datasets that fit on a single machine and for rapid prototyping. It doesn't have great support for typos and it won't scale once your dataset grows bigger.
There's certainly some more features to explore in this tool, if you're curious please check the documentation.
Back to main.