Diff of two pdf files?

I'm looking for a good program to show me the differences between two similar pdf files. In particular, I'm looking for something that doesn't just run diff on an ascii version (with "pdftotext") of the files. This is what pdfdiff.py does.

7

6 Answers

You can use DiffPDF for this. From the description:

DiffPDF is used to compare two PDF files. By default the comparison is of the text on each pair of pages, but comparing the appearance of pages is also supported (for example, if a diagram is changed or a paragraph reformatted). It is also possible to c> ompare particular pages or page ranges. For example, if there are two versions of a PDF file, one with pages 1-12 and the other with pages 1-13 because of an extra page having been added as page 4, they can be compared by specifying two page ranges, 1-12 for the first and 1-3, 5-13 for the second. This will make DiffPDF compare pages in the pairs (1, 1), (2, 2), (3, 3), (4, 5), (5, 6), and so on, to (12, 13).

2

I just figured out a hack to make DiffPDF (the program suggested by @qbi) usable for more than minor changes. What I do is concatenate all pages pdfs into a long scroll using pdfjam and then compare the scrolls. It works even when large sections are removed or inserted!

Here is a bash script that does the job:

#!/bin/bash
#
# Compare two PDF files.
# Dependencies:
# - pdfinfo (xpdf)
# - pdfjam (texlive-extra-utils)
# - diffpdf
#
MAX_HEIGHT=15840 #The maximum height of a page (in points), limited by pdfjam.
TMPFILE1=$(mktemp /tmp/XXXXXX.pdf)
TMPFILE2=$(mktemp /tmp/XXXXXX.pdf)
usage="usage: scrolldiff -h FILE1.pdf FILE2.pdf -h print this message
v0.0"
while getopts "h" OPTIONS ; do case ${OPTIONS} in h|-help) echo "${usage}"; exit;; esac
done
shift $(($OPTIND - 1))
if [ -z "$1" ] || [ -z "$2" ] || [ ! -f "$1" ] || [ ! -f "$2" ]
then echo "ERROR: input files do not exist." echo echo "$usage" exit
fi #Get the number of pages:
pages1=$( pdfinfo "$1" | grep 'Pages' - | awk '{print $2}' )
pages2=$( pdfinfo "$2" | grep 'Pages' - | awk '{print $2}' )
numpages=$pages2
if [[ $pages1 > $pages2 ]]
then numpages=$pages1
fi #Get the paper size:
width1=$( pdfinfo "$1" | grep 'Page size' | awk '{print $3}' )
height1=$( pdfinfo "$1" | grep 'Page size' | awk '{print $5}' )
width2=$( pdfinfo "$2" | grep 'Page size' | awk '{print $3}' )
height2=$( pdfinfo "$2" | grep 'Page size' | awk '{print $5}' )
if [ $(bc <<< "$width1 < $width2") -eq 1 ]
then width1=$width2
fi
if [ $(bc <<< "$height1 < $height2") -eq 1 ]
then height1=$height2
fi
height=$( echo "scale=2; $height1 * $numpages" | bc )
if [ $(bc <<< "$MAX_HEIGHT < $height") -eq 1 ]
then height=$MAX_HEIGHT
fi
papersize="${width1}pt,${height}pt" #Make the scrolls:
pdfj="pdfjam --nup 1x$numpages --papersize {${papersize}} --outfile"
$pdfj "$TMPFILE1" "$1"
$pdfj "$TMPFILE2" "$2"
diffpdf "$TMPFILE1" "$TMPFILE2"
rm -f $TMPFILE1 $TMPFILE2
8

Even though this doesn't solve the issue directly, here is a nice way to do it all from the commandline with few dependencies:

diff <(pdftotext -layout old.pdf /dev/stdout) <(pdftotext -layout new.pdf /dev/stdout)

It works really well for basic pdf comparisons. If you have a newer version of pdftotext you can try -bbox instead of -layout.

As far as diffing programs go, I like using diffuse, so the command changes ever so slightly:

diffuse <(pdftotext -layout old.pdf /dev/stdout) <(pdftotext -layout new.pdf /dev/stdout)

Hope that helps.

If you have 2-3 huge pdf (or epub or other formats, read below) files to compare , then it is possible to combine the power of:

  1. calibre (to convert your source to text)

  2. meld (to visually search for the differences between the text files)

  3. parallel (to use all your system cores to speed up)

Below script accept as input any of the following file formats: MOBI, LIT, PRC, EPUB, ODT, HTML, CBR, CBZ, RTF, TXT, PDF and LRS.

If not installed, then install meld, calibre and parallel:

#install packages
sudo apt-get -y install meld calibre parallel

To be able to execute the code from anywhere in your computer, save following code in a file named "diffepub" (with no extensions) inside directory "/usr/local/bin".

usage="
*** usage:
diffepub - compare text in two files. Valid format for input files are:
MOBI, LIT, PRC, EPUB, ODT, HTML, CBR, CBZ, RTF, TXT, PDF and LRS.
diffepub -h | FILE1 FILE2
-h print this message
Example:
diffepub my_file1.pdf my_file2.pdf
diffepub my_file1.epub my_file2.epub
v0.2 (added parallel and 3 files processing)
"
#parse command line options
while getopts "h" OPTIONS ; do case ${OPTIONS} in h|-help) echo "${usage}"; exit;; esac
done
shift $(($OPTIND - 1))
#check if first 2 command line arguments are files
if [ -z "$1" ] || [ -z "$2" ] || [ ! -f "$1" ] || [ ! -f "$2" ]
then echo "ERROR: input files do not exist." echo echo "$usage" exit
fi
#create temporary files (first & last 10 characters of
# input files w/o extension)
file1=`basename "$1" | sed -r -e '
s/\..*$// #strip file extension
s/(^.{1,10}).*(.{10})/\1__\2/ #take first-last 10 chars
s/$/_XXX.txt/ #add tmp file extension
'`
TMPFILE1=$(mktemp --tmpdir "$file1")
file2=`basename "$2" | sed -r -e '
s/\..*$// #strip file extension
s/(^.{1,10}).*(.{10})/\1__\2/ #take first-last 10 chars
s/$/_XXX.txt/ #add tmp file extension
'`
TMPFILE2=$(mktemp --tmpdir "$file2")
if [ "$#" -gt 2 ]
then file3=`basename "$3" | sed -r -e ' s/\..*$// #strip file extension s/(^.{1,10}).*(.{10})/\1__\2/ #take first-last 10 chars s/$/_XXX.txt/ #add tmp file extension '` TMPFILE3=$(mktemp --tmpdir "$file3")
fi
#convert to txt and compare using meld
doit(){ #to solve __space__ between filenames and parallel ebook-convert $1
}
export -f doit
if [ "$#" -gt 2 ]
then (parallel doit ::: "$1 $TMPFILE1" \ "$2 $TMPFILE2" \ "$3 $TMPFILE3" ) && (meld "$TMPFILE1" "$TMPFILE2" "$TMPFILE3")
else (parallel doit ::: "$1 $TMPFILE1" \ "$2 $TMPFILE2" ) && (meld "$TMPFILE1" "$TMPFILE2")
fi

Make sure the owner is your user and it has execution permissions:

sudo chown $USER:$USER /usr/local/bin/diffepub
sudo chmod 700 /usr/local/bin/diffepub

To test it, just type:

diffepub FILE1 FILE2

I test it to compare 2 revisions of a +1600 pages pdf and it works perfect. Because calibre is written using python for portability, it took 10 minutes to convert both files to text. Slow, but reliable.

We've been working on a tool and wanted to chime in.

If you're happy with trying an online tool, we built something at Draftable.com which does what you seem to want - compare two PDF/Word files and show deletions and additions.

Right now, our Desktop version is Windows only; but, we also have an API that we published a few years ago and it has been working very well for people with high volumes or security concerns.

I've prepared an image (link below) so that you can see the kind of output you'd get without needing to visit the site. Feedback greatly appreciated!

Sample comparison

As complement to the above answer about diff and diffuse we can use Meld as graphical comparison tool - install it with

sudo apt-get install meld

and then compare documents with command like

meld <(pdftotext -layout old.pdf /dev/stdout) <(pdftotext -layout new.pdf /dev/stdout)

Personally I like Meld more than DiffUse of Kdiff3.

Your Answer

Sign up or log in

Sign up using Google Sign up using Facebook Sign up using Email and Password

Post as a guest

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy

You Might Also Like