I’ve started using Tesseract to add an optical character recognition (OCR) layer to PDFs. What follows are my notes on getting this to a reasonable state, and a word of warning about Preview on Sierra.
Background
I’ve written about my collection of articles before. They’re all PDFs and indexed in Zotero. Their source various: some are distillations of digital documents, some are scans from print or microfilm. Some, but especially the latter, haven’t been through OCR so they aren’t searchable. That’s not a big deal in a 1-2 page article, but in longer works it’s obnoxious. Adding OCR also exposes the text to Spotlight, OS X’s internal search.
Tesseract
Tesseract is the de facto standard for open source OCR solutions. It’s installable via homebrew. That’s easy. I discovered pretty quickly that Tesseract doesn’t work with PDFs out of the box. I wasn’t averse to building a script that pulled apart the PDF, converted each page to a TIFF, did the OCR, and then put it all back together, but I figured someone had crafted a more elegant solution.
pdfsandwich
Per Cornelius’ excellent tutorial, the more elegant solution is Tobias Elze’s pdfsandwich, which wraps all that, plus plenty of additional functionality which I don’t fully grok yet. The code is available via SVN: svn checkout svn://svn.code.sf.net/p/pdfsandwich/code/trunk/src pdfsandwich
You’re going to need a few more dependencies from homebrew, in addition to tesseract:
- poppler: provides the pdfunite package
- gawk: used during compilation
- ocaml: used during compilation
- unpaper: handles OCR preprocessing and other things
Those are all the dependencies I needed to install; a bare system probably needs more.
Examples
You invoke pdfsandwich from the command line. pdfsandwich —help
gives a rundown of the many options. This is the most basic invocation:
pdfsandwich -lang eng <filename>.pdf
This will process the given file and output a new file with the OCR layer included. By default the new file is named <filename>_ocr.pdf
. If the file has images you’ll want an additional argument to ensure they’re overlaid correctly. I used the -gray
flag for grayscale documents and -rgb
for color. Note that while pdfsandwich use multiple threads for processing a large document (50+ pages) will take at least several minutes on a pretty fast MacBook.
By default the command spits out quite a bit of information which you’re free to ignore. Once it’s done you’ll have a new PDF and the text is searchable.
Beware Preview
There’s a pretty nasty bug on OS X which, depending on who you talk to, was introduced in Sierra or has been around for years. In a nutshell, saving a PDF in Preview can corrupt the OCR layer. The text will no longer be searchable nor copyable. Despite reports to the contrary it’s not fixed (at least not for me) in 10.12.3. I’ve taken the precaution of locking the PDFs for now. I’m sure I’ll forget in two months time, but it’ll solve the immediate problem.