I’ve started using Tesseract to add an optical character recognition (OCR) layer to PDFs. What follows are my notes on getting this to a reasonable state, and a word of warning about Preview on Sierra.
Background
I’ve written about my collection of articles before. They’re all PDFs and indexed in Zotero. Their source various: some are distillations of digital documents, some are scans from print or microfilm. Some, but especially the latter, haven’t been through OCR so they aren’t searchable. That’s not a big deal in a 1-2 page article, but in longer works it’s obnoxious. Adding OCR also exposes the text to Spotlight, OS X’s internal search.