Yesterday, I wanted to convert over a hundered PDFs to plain text to allow for easier and quicker copying and pasting during product updates for a client. After a short while searching I found a nice and effective way to go about it, and I’m very pleased with how it handles line breaks and paragraphs.
The command line utility which enables the conversion is called pdftotext, and is part of the Xpdf suite. It comes with Ubuntu (which I’m using for my workstation) as well as many other GNU/Linux distributions.
Using it to convert a single PDF to text is simply a case of giving it the filename of a PDF:
$ pdftotext catalogue-page-1.pdf
However, pdftotext expects one and only one filename, so getting it to process a whole directory of PDFs needed another command:
$ find ~/new-products/ -iname "*.pdf" -exec pdftotext {} \;
“Riad” posted this command, and here’s his explanation:
1. -iname : the “i” stands for ignore case, means your find will match .pdf as well as .PDF, .Pdf …
2. The -exec switch runs a program on each matching file. The name of the file is inserted into the command by a pair of curly braces ({}) and the command ends with an escaped semicolon. (If the semicolon is not escaped, the shell interprets it as the end of the find command instead.)