How to extract information from a PDF containing images using Python Tesseract on Mac OS

Sometimes you need to take a written piece of information from the real world (e. g. a letter, a document) and enter information from that document into the computer. Now reading information from a document off-screen and typing it in manually is error-prone, time-consuming and boring and therefore I wrote a simple Python script that takes pages of a PDF, transforms the single pages into images and extracts all the written text using an OCR engine, making it possible to copy and paste the contents of the document easily.

I am assuming you have the latest version of Python installed (as of this date, it should be v. 3.7.3). We are going to need a few libraries for this: Poppler, pdf2image, tesseract and pytesseract.

pdf2image is a Python library that wraps Poppler, which is a PDF rendering library. Tesseract is an open-source OCR (optical character recognition) engine developed by Google ( https://opensource.google/projects/tesseract). And pytesseract wraps this open-source library for Python.

Tesseract is an OCR engine with support for unicode and the ability to recognize more than 100 languages out of the box. It can be trained to recognize other languages. (Source: https://opensource.google/projects/tesseract)

Let’s start with installing all the required libraries. For installing Tesseract and Poppler, I am relying on homebrew this time (I usually prefer to build from source manually). For installing the Python libraries, I am going to use the package installer PIP3 which is suitable for all Python 3 versions.

brew install tesseract poppler
pip3 install pdf2image pytesseract

Next we are going to write our simple script that will:

Take a PDF with images (e. g. a letter)
Convert the PDF into a series of pages
Iterate over the pages and save them as images to the disk
Read the images and read the text into a string

import PIL
import pytesseract
import pdf2image

# Convert PDF contents to pages
pages = pdf2image.convert_from_path('letter.pdf', 500)

# Just using this to give the pages a number
counter = 0

for page in pages:
  file_name = 'page' + str(counter) + '.jpg'

  # Save images to the same folder
  page.save(file_name, 'JPEG')

  # Open the file as an image
  image_file = PIL.Image.open(file_name)

  # Use tesseract to extract the text from the image
  string_contents = pytesseract.image_to_string(image_file)

  # Print the contents to the console
  print(string_contents)

  counter = counter + 1

Processing the information can take a few seconds, so be patient. You will receive the output of the document in the console and the image files in the same folder you are running the script in.

Now here comes the catch. Depending on how good the quality of the picture is (including the angle, blurriness etc.) the output can vary and it’s very possible that there are some wrongly recognized characters in the output. Therefore you should double check the output and do some error correction if necessary.