UNB/ CS/ David Bremner/ blog/ posts/ Extracting text from pdf with pdfedit

It turns out that pdfedit is pretty good at extracting text from pdf files. Here is a script I wrote to do that in batch mode.

#!/bin/sh
# Print the text from a pdf document on stdout
# Copyright: (c) 2006-2010 PDFedit team  <http://sourceforge.net/projects/pdfedit>
# Copyright: (c) 2010, David Bremner <david@tethera.net>
# Licensed under version 2 or later of the GNU GPL

set -e

if [ $# -lt 1 ]; then
    echo usage: $0 file [pageSep]
    exit 1
fi

#!/bin/sh
# Print the text from a pdf document on stdout
# Copyright: © 2006-2010 PDFedit team  <http://sourceforge.net/projects/pdfedit>
# Copyright: © 2010, David Bremner <david@tethera.net>
# Licensed under version 2 or later of the GNU GPL

set -e

if [ $# -lt 1 ]; then
    echo usage: $0 file [pageSep]
    exit 1
fi

/usr/bin/pdfedit -console -eval '
function onConsoleStart() {
    var inName = takeParameter();
    var pageSep = takeParameter();
    var doc = loadPdf(inName,false);

    pages=doc.getPageCount();
    for (i=1;i<=pages;i++) {
        pg=doc.getPage(i);
        text=pg.getText();  
        print(text);
        print("\n");
        print(pageSep);
    }
}
' $1 $2

Yeah, I wish #!/usr/bin/pdfedit worked too. Thanks to Aaron M Ucko for pointing out that -eval could replace the use of a temporary file.

Oh, and pdfedit will be even better when the authors release a new version that fixes truncating wide text