Back to main.

Calmcode Shorts

pytesseract.py logopytesseract.py

If you want to extract text out of an image, you may enjoy using tesseract. It's an old school tool that works quite well and it's also very lightweight. You'll first want to install the tool on your system before you can access it from python.

# For Ubuntu
apt install tesseract-ocr
# For Mac
brew install tesseract

Once that is installed, you can install the python binding in your virtual environment.

python -m pip install pytesseract

OCR for Pytesseract

Once that's all installed, you can use pytesseract from inside of your jupyter notebook!

try:
    from PIL import Image
except ImportError:
    import Image

import pytesseract

img = Image.open('path/to/img.png')
print(pytesseract.image_to_string(img))

You can even get the bounding boxes out per character if you'd like.

print(pytesseract.image_to_boxes(img))

Small Warning

Altough tesseract works quite well in many applications, it's not a perfect solution. It mainly works on images with a white background that have text on it that's generated by a printer or a computer. When you're using handwritten notes that are all over a page then your milage might certainly vary.


Back to main.