OCRMYPDF - handy when automatating archiving downloaded papers
Disclaimer: The majority of the code provided below was sourced from the web. However, currently, on the web there is no satisfactory solution to automating converting a .pdf to a searchable pdf with ocrmypdf and Hazel. This blog provides a working solution to this problem
If you use ocrmypdf to convert a .pdf file into a searchable .pdf through the use of Hazel or (a folder action created with automation), you might find the script that is openly available on the web does not execute this. This is because you need to include the correct path for TMPDIR in your computer in the shell script. To find the correct path to TMPDIR you need to use the following command in your terminal:
-% echo $TMPDIR
To find which shell you are working you need to use the following command:
-% echo $0
(this code is better than $SHELL)
In the case of my computer, the shell is, /bin/zsh
If you are using Hazel, the information on the Shell should be entered into the top box on the interface where you would be entering your shell code.
In the main window, you could enter the following code
PATH=$PATH:/opt/homebrew/bin
export PATH
export TMPDIR="/var/folders/f0/jpk6g96n3bj5ctfhpfmwc1_80000gn/T/"
filename=$(basename "$1")
converting_directory=/Users/username/Library/Mobile Documents/com~apple~CloudDocs/reprint-adobe/
if ocrmypdf --skip-text "$1" "$converting_directory""${1%.pdf}_ocr.pdf"
then
tag -a "pdf/a,scan" "$converting_directory""${1%.pdf}_ocr.pdf"
rm $1
echo "OCR succeeded"
else
echo "OCR failed"
fi
You would need to change the converting_directory to reflect the path to the folder where you want the folder action. The above script converts a .pdf into a searchable .pdf and adds ‘ocr’ to the name of the new file and deletes the original file
In order for the Hazel not to run in a loop executing the same routine on the same file, you need to create a condition to halt the process once the version has happened. I managed to do this by adding a condition. This condition halted the procedure, if the the name of the file had ‘ocr’ in it. You can also use use tags to create appropriate condition logic for this purpose.
NitropdfPen pro
An easier way of achieving the same end is to use Nitropdfpen Pro. However, this would cost about £170 to buy the software. if you use Nitropdfpen pro, the the AppleScript code to include in the rule in Hazel would be:
tell application "PDFpenPro"
open theFile as alias
tell document 1
ocr
repeat while performing ocr
delay 1
end repeat
delay 1
close with saving
end tell
end tell
tell application “PDFpenpro” worked fine for me. If it does not, you might wishing to try tell application “NitroPDFpenpro”
(Nitro had bought the PDFPenPro from Smile software)