OCRMYPDF - handy when automatating archiving downloaded papers

Aug 10

Written By Ganesalingam Narenthiran FEBNS FRCS(SN)

Disclaimer: The majority of the code provided below was sourced from the web. However, currently, on the web there is no satisfactory solution to automating converting a .pdf to a searchable pdf with ocrmypdf and Hazel. This blog provides a working solution to this problem

If you use ocrmypdf to convert a .pdf file into a searchable .pdf through the use of Hazel or (a folder action created with automation), you might find the script that is openly available on the web does not execute this. This is because you need to include the correct path for TMPDIR in your computer in the shell script. To find the correct path to TMPDIR you need to use the following command in your terminal:

-% echo $TMPDIR

To find which shell you are working you need to use the following command:

-% echo $0

(this code is better than $SHELL)

In the case of my computer, the shell is, /bin/zsh

If you are using Hazel, the information on the Shell should be entered into the top box on the interface where you would be entering your shell code.

In the main window, you could enter the following code

PATH=$PATH:/opt/homebrew/bin

export PATH

export TMPDIR="/var/folders/f0/jpk6g96n3bj5ctfhpfmwc1_80000gn/T/"

filename=$(basename "$1")

converting_directory=/Users/username/Library/Mobile Documents/com~apple~CloudDocs/reprint-adobe/

if ocrmypdf --skip-text "$1" "$converting_directory""${1%.pdf}_ocr.pdf"

then

tag -a "pdf/a,scan" "$converting_directory""${1%.pdf}_ocr.pdf"

rm $1

echo "OCR succeeded"

else

echo "OCR failed"

You would need to change the converting_directory to reflect the path to the folder where you want the folder action. The above script converts a .pdf into a searchable .pdf and adds ‘ocr’ to the name of the new file and deletes the original file

In order for the Hazel not to run in a loop executing the same routine on the same file, you need to create a condition to halt the process once the version has happened. I managed to do this by adding a condition. This condition halted the procedure, if the the name of the file had ‘ocr’ in it. You can also use use tags to create appropriate condition logic for this purpose.

NitropdfPen pro

An easier way of achieving the same end is to use Nitropdfpen Pro. However, this would cost about £170 to buy the software. if you use Nitropdfpen pro, the the AppleScript code to include in the rule in Hazel would be:

tell application "PDFpenPro"

open theFile as alias

tell document 1

ocr

repeat while performing ocr

delay 1

end repeat

delay 1

close with saving

end tell

tell application “PDFpenpro” worked fine for me. If it does not, you might wishing to try tell application “NitroPDFpenpro”

(Nitro had bought the PDFPenPro from Smile software)

Ganesalingam Narenthiran FEBNS FRCS(SN)

OCRMYPDF - handy when automatating archiving downloaded papers

NitropdfPen pro

Blog Post Title Two