TITLE: Extracting pages with colour from a PDF
DATE: 2021-10-26
AUTHOR: John L. Godlee
====================================================================


I wanted to print my PhD thesis so I could have a version to
annotate before my viva. The cost at my local copy shop to print a
full colour version of the thesis would have been somewhere around
£60, while a black and white copy only cost about £15. It wasn't
necessary to print the whole document in colour as only pages with
figures contained any colour, so I wanted to find a way to
automatically extract the pages which did contain colour and create
a new document containing only those pages, so I could print those
in colour separately.

I created a shell script that uses ghostscript (gs) to find the
colour pages, and pdfjam to extract those pages and create a new
document:

 [ghostscript (gs)]: https://ghostscript.com/
 [pdfjam]: https://github.com/rrthomas/pdfjam

   #!/usr/bin/env sh

   # Extract colour pages from a PDF, then create a new PDF
containing only those pages. Useful for saving on printing costs.

   if [ "$#" -ne 2 ]; then
       echo "Usage: $0 <input.pdf> <output.pdf>"
       exit 2
   fi

   if [ ! -f $1 ]; then
       echo "Input file not found"
       exit 2
   fi

   pages=$(gs -o - -sDEVICE=inkcov "${1}" | tail -n +6 | sed
'/^Page*/N;s/\n//' | sed -E '/Page [0-9]+ 0.00000  0.00000  0.00000
/ d' | grep -Eo '^Page\s[0-9]+' | awk '{print $2}' | tr '\n' ',' |
sed 's/,$//g')

   if [ -z "${pages}" ]; then
       echo "File has no colour pages"
       exit 2
   fi

   pdfjam "${1}" ${pages} -o "${2}" &> /dev/null

The first part of the script with the if statements simply checks
whether the parameters passed to the script are valid. The script
needs to be fed an existing input file, and an output file name.

The pages variable is created by using the inkcov device provided
in gs >v9.05. The inkcov device displays the ink coverage
separately for each page, so all that needs to be done is to
exclude pages which contain only black, and then format the page
numbers in the way that pdfjam expects. If no colour pages are
found then the script exits without creating a new PDF. pdfjam then
takes the input filename, the page range, and the output filename
and creates a new PDF document.