Back to main.

Calmcode Shorts

flashtext.py logoflashtext.py

Let's say that we want to detect a set of names in a text. We could use machine learning for that, maybe do a little entity recognition, but you might also just try and compare to a namelist. So let's write the code that might do that.

We'll use a subset of the English names dataset originally found on Kaggle. You can download the namelist that this video uses here.

from pathlib import Path

names = Path("calmcode/static/data/names.txt").read_text().split("\n")

# Lower case to help make things easy
sentence = "Vincent Willem van Gogh was a Dutch Post-Impressionist painter.".lower()

def contains_name(sentence, names):
    result = []
    for name in names:
        if name in sentence:
            result.append(name)
    return result

This code contains a function called contains_name that can tell us which names have been found in the sentence. It works, but there are two downsides.

  1. While the function tells us that the name appears in the text, it doesn't tell us where it appears in the text.
  2. The function will become slower and slower the more names we have. It only takes ~30 µs when we have 1000 names, but it'd be nice if we didn't incur a computational burden if we add more names to our list.

Luckily, there's a library called flashtext that solves both issues for us.

from flashtext import KeywordProcessor

keyword_processor = KeywordProcessor(case_sensitive=False)

for name in names:
    keyword_processor.add_keyword(name)

keyword_processor.extract_keywords(sentence, span_info=True)
# [('vincent', 0, 7)]

This is much faster. About 4 µs vs the 30 µs that we had before and it also gives us extra information that we might use.

Flashtext is a nice library when it comes to finding substrings in texts. It respects whitespaces and it's got very few dependencies to boot!

How?

You might wonder how this library is so much faster and this is related to a datastructure trick. When we add keywords, an index is built under the hood. When we call .extract_keywords() this datastructure is re-used.

Related

There's also a ahocorasick_rs Python library with a similar algorithm impemted in rust. The implementation is very fast, but it doesn't respect whitespaces. That means that you may detect mark in remarkable. If your usecase allows for it this library may be worth your while, but flashtext may be the easy alternative.


Back to main.