Let's say that we want to detect a set of names in a text. We could use machine learning for that, maybe do a little entity recognition, but you might also just try and compare to a namelist. So let's write the code that might do that.
from pathlib import Path
names = Path("calmcode/static/data/names.txt").read_text().split("\n")
# Lower case to help make things easy
sentence = "Vincent Willem van Gogh was a Dutch Post-Impressionist painter.".lower()
def contains_name(sentence, names):
result = 
for name in names:
if name in sentence:
This code contains a function called
contains_name that can tell us which names
have been found in the sentence. It works, but there are two downsides.
- While the function tells us that the name appears in the text, it doesn't tell us where it appears in the text.
- The function will become slower and slower the more names we have. It only takes ~30 µs when we have 1000 names, but it'd be nice if we didn't incur a computational burden if we add more names to our list.
Luckily, there's a library called
flashtext that solves both issues for us.
from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor(case_sensitive=False)
for name in names:
# [('vincent', 0, 7)]
This is much faster. About 4 µs vs the 30 µs that we had before and it also gives us extra information that we might use.
Flashtext is a nice library when it comes to finding substrings in texts. It respects whitespaces and it's got very few dependencies to boot!
You might wonder how this library is so much faster and this is related to a datastructure
trick. When we add keywords, an index is built under the hood. When we call
this datastructure is re-used.
There's also a ahocorasick_rs Python library with a similar algorithm impemted in rust.
The implementation is very fast, but it doesn't respect whitespaces. That means that you may detect
mark in remarkable. If your usecase
allows for it this library may be worth your while, but
flashtext may be the easy alternative.
Back to main.