script to split and ocr a large pdf with acrobat

I decided to write a python script inspired by Bill Turkel’s recent post on bursting large pdfs into small chunks for more efficient searching, and just as an exercise. I have a bunch of pdfs downloaded from Google Books, which don’t have a text layer.  Many of these are very large– 600+ page pdfs– and my machine really bogs down when OCR’ing such a giant file, so I’ve put off doing it for a long time. The idea here would be to split these large files into single-page pdfs with a regular naming scheme, saved in a new or designated folder, then sent to Acrobat 9 for OCR. I have found, by the way, that Acrobat’s OCR with Clearscan turned on is the best out there, or at least the best I’ve used. I also wanted to do it all in python. Using Acrobat would require scripting an application, which OSX usually does through Applescript.

I’ve never used applescript, so first I went looking for a way to script applications using python. Turns out you can using appscript, but only if you can also write the applescript to do it. Appscript offers libraries for Ojbective C, Ruby, and Python, and also two help apps– AS Translate and AS Dictionary. AS Translate allows you to type in applescript code, and have it translated into ObjC, Ruby, or Python code that you can copy/paste. Turns out that Acrobat has minimal scripting abilities with applescript, so it requires using lots of gui/keystroke scripting, which is available through an applescript app, “System Events.” Even still, this kind of scripting does not handle interactive events like popup windows very well at all. So, figuring this out was a huge pain in the ass. For example, if you try to OCR a pdf that already has renderable text, Acrobat tells you it can’t/won’t OCR through a popup window that has to be dismissed with an “OK” button, This posed a problem for walking through a series of pdfs, some of which already have renderable text, and would break the program.

Given that the readers of this blog aren’t programmers, I’ll walk through the steps in putting the script together. What were the steps I wanted to accomplish? 1. Select a pdf to split; 2. Select a (new) folder to put the split files in; 3. Split the pdf; 4. Send each of the new files to Acrobat for OCR if needed (with a test to see if it’s needed); 5. Save the OCRed files. Didn’t sound all that hard, but working with binary files always requires a little special sauce. As frequently is the case with python, there are some modules out there to help. So, in addition to appscript, I also needed pyPdf to split and, it turned out, test for extractable text. And of course, as always, I use easygui for file and other dialogues. In addition to these external libraries, I needed the OS, time, and sys libraries from the core. (Here’s this first script in full.)

So, first up, import the needed modules:

import easygui, os, time, sys
from appscript import *
from pyPdf import PdfFileWriter, PdfFileReader

Next, pick a pdf to split, and a folder for the new files. easygui returns a file path, so I also needed to get the simple file name from that path using os.basename. And, we need to set the working directory for the script to the newly-choosen or created directory:

originalPDF = easygui.fileopenbox(msg="Pick a PDF to split and OCR.", filetypes=["*.pdf"])
destDir = easygui.diropenbox(msg="Pick a folder for the new files.")

fileNameBase = os.path.basename(originalPDF)

os.chdir(destDir)

Next, use pyPdf to split the pdf into a series of single pdfs. It does this by loading the pdf, getting each page and writing that page to a new file, so we use a for to iterate through each page:

inputpdf = PdfFileReader(file(originalPDF, "rb"))
for i in xrange(inputpdf.numPages):
    output = PdfFileWriter()
    output.addPage(inputpdf.getPage(i))
    outputStream = file(fileNameBase[:-4]+"%s.pdf" % i, "wb")
    output.write(outputStream)
    outputStream.close()

Next, I put up a box to ask if the I want the new PDFs to be OCRed. If they came from, say, JSTOR , I know they already have a text layer and there’s not need to bother:

if easygui.ynbox(msg="Does this pdf need OCR?"):
	pass
else:
	sys.exit(0)

Click on “No”, and the script exits. If not, it moves on. At this point, it’s time to get a list of the new files, test each one to see if it has extractable text, and if not pass it on. Why test the files? Well, a google book pdf comes with a page or two of renderable text that google appends to the front of the file. This means if I simply walked through the directory, Acrobat’s error message would trip the program without moving on to the next file. Also, OS X adds a .DS_Store file to a directory for Finder and Spotlight searching, and Acrobat can’t exactly open those files if they are passed by the script. So, two functions we need– 1. a directory listing without the .DS_Store file; and 2. A test for extractable text:

def mylistdir(directory):
    filelist = os.listdir(directory)
    return [x for x in filelist
            if not (x.startswith('.'))]

def tryExtract(path):
	result = None
	try:
		content = ''
		pdf = PdfFileReader(file(path, 'rb'))
		content += pdf.getPage(0).extractText()
		if content == '':
			result = None
		else:
			result = 1
	except IOError:
		pass
	return result	

tryExtract takes a file path, opens the pdf, and tries to extract its content. If none is extracted, it returns None, which in the next snippet tells the program to pass the pdf to Acrobat for OCR. Next, we us appscript to open the pdf in Acrobat, OCR it, save and close it, and then move on to the next one. If a file doesn’t need OCR because it didn’t pass the extraction test, a dialogue box tells me. When all the files are finished, another box let’s me know and the program exits:

pdfList = mylistdir(destDir)
for newpdf in pdfList:
	if tryExtract(destDir+'/'+newpdf) == None:
		acrobat.open(destDir+"/"+newpdf)	
		time.sleep(1)
		app(u'System Events').processes[u'Acrobat'].menu_bars[1].menus[u'Document'].menu_items[u'OCR Text Recognition'].menus[1].menu_items[u'Recognize Text Using OCR...'].click()
		time.sleep(1)
		app(u'System Events').processes[u'Acrobat'].keystroke(u'\r')
#		time.sleep(5)
		app(u'System Events').processes[u'Acrobat'].keystroke(u's', using=[k.command_down])
		acrobat.documents[1].close()
		time.sleep(1)
	else:
		easygui.msgbox(msg="No OCR needed.")

easygui.msgbox(msg="All done!")

Some of those lines of code are ugly. Why? That’s the appscript translation of applescript, executing gui commands. What you see on screen is literally Adobe Acrobat opening files, OCRing, saving, and closing. It is automated, but still pretty silly. So, even though I accomplished the task I had set out (after hours of learning enough applescript to do the same thing just using applescript, and after tons of google searches and even a stackoverflow question), I don’t really even like it.

After finishing, I realized that Acrobat has a batch OCR dialogue. As with other aspects of Acrobat, I couldn’t fully script it. But, it does offer the advantage of removing the need to test each pdf for extractable text. So, in the final version I dispensed with all that, which is ridiculous because I wasted tons of time figuring this all out. (I’m not a very good programmer, I think.) The shortened routine, which relies of Acrobat’s GUI for choosing the folder of files to OCR, is this:

#!/usr/bin/env python
# encoding: utf-8
"""
splitOCRpdf.py

Created by Chad Black on 2011-04-04.

"""

import easygui, os, time, sys
from appscript import *
from pyPdf import PdfFileWriter, PdfFileReader


# chose a pdf to split and OCR, along with a destination folder
# for the resulting files.
originalPDF = easygui.fileopenbox(msg="Pick a PDF to split and OCR.", filetypes=["*.pdf"])
destDir = easygui.diropenbox(msg="Pick a folder for the new files.")

# extract filename of the original file
fileNameBase = os.path.basename(originalPDF)

# change the working directory to the destination folder
os.chdir(destDir)

# split the pdf into single page pdfs, save to destination directory.
inputpdf = PdfFileReader(file(originalPDF, "rb"))
for i in xrange(inputpdf.numPages):
 output = PdfFileWriter()
 output.addPage(inputpdf.getPage(i))
 outputStream = file(fileNameBase[:-4]+"%s.pdf" % i, "wb")
 output.write(outputStream)
 outputStream.close()

# ask if the split pdf needs to be OCR'd, if not the script exits.
if easygui.ynbox(msg="Does this pdf need OCR?"):
	pass
else:
	sys.exit(0)

# send the folder to acrobat for OCR batch
acrobat = app(u'Adobe Acrobat Pro')
acrobat.activate()
app(u'System Events').processes[u'Acrobat'].menu_bars[1].menus[u'Document'].menu_items[u'OCR Text Recognition'].menus[1].menu_items[u'Recognize Text in Multiple Files Using OCR...'].click()

Feel free to use it if you’re on a Mac and need to do that sort of thing. In fact, you can even use Automator to make this (or any other shell script) into a click-able application, and launch it directly from your applications folder.

About

Associate Professor of Early Latin America Department of History University of Tennessee-Knoxville

Tagged with: , , ,
Posted in Digital History, programming
2 comments on “script to split and ocr a large pdf with acrobat
  1. […] my other home ← script to split and ocr a large pdf with acrobat […]

  2. […] monitoring server memory, batch renaming photos, tweeting from the command line (here and here), bursting and OCRing pdfs, posting to wordpress.com using markdown, using easygui for pythonic historians, and on making a […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

parecer:
parecer:

Hacer juicio ú dictamen acerca de alguna cosa... significando que el objeto excita el juicio ú dictamen en la persona que le hace.

Deducir ante el Juez la accion ú derecho que se tiene, ó las excepciones que excluyen la accion contrária.

RAE 1737 Academia autoridades
Buy my book!



Chad Black

About:
I, your humble contributor, am Chad Black. You can also find me on the web here.
%d bloggers like this: