It turns out that pdfedit is pretty good at extracting text from pdf files. Here is a script I wrote to do that in batch mode.
#!/bin/sh # Print the text from a pdf document on stdout # Copyright: (c) 2006-2010 PDFedit team <http://sourceforge.net/projects/pdfedit> # Copyright: (c) 2010, David Bremner <david@tethera.net> # Licensed under version 2 or later of the GNU GPL set -e if [ $# -lt 1 ]; then echo usage: $0 file [pageSep] exit 1 fi #!/bin/sh # Print the text from a pdf document on stdout # Copyright: © 2006-2010 PDFedit team <http://sourceforge.net/projects/pdfedit> # Copyright: © 2010, David Bremner <david@tethera.net> # Licensed under version 2 or later of the GNU GPL set -e if [ $# -lt 1 ]; then echo usage: $0 file [pageSep] exit 1 fi /usr/bin/pdfedit -console -eval ' function onConsoleStart() { var inName = takeParameter(); var pageSep = takeParameter(); var doc = loadPdf(inName,false); pages=doc.getPageCount(); for (i=1;i<=pages;i++) { pg=doc.getPage(i); text=pg.getText(); print(text); print("\n"); print(pageSep); } } ' $1 $2
Yeah, I wish #!/usr/bin/pdfedit worked too. Thanks to Aaron M Ucko for pointing out that
-eval could replace the use of a temporary file.
Oh, and pdfedit will be even better when the authors release a new version that fixes truncating wide text