I recently purchased some DRM-free comics from humble bundle. Usually they provide CBZ formats, but this time they only offered standard PDF. So I decided to learn how to use tools to run the conversion. I’ll document my findings here, so if anyone else wants to do the same, you can see how.
PDF is a format that is meant to describe to an output device, like a screen or a printer, exactly how a document should look. You can embed an image within these documents, and no matter what printer you use, the page that comes out will look the same (assuming you have enough ink, etc.). However, PDF isn’t so great if you want to work with individual images or blocks of text.
If you want a device like an eReader to display them well in a way that lets you zoom and pan easily, you are better off with a format like CBZ. I also have trouble getting older eReaders to display very large (Gigabyte sized) PDF files, so being able to split them up into smaller CBZ files is helpful. A CBZ file is a zip folder with individual images, named like so: image001.png, image002.png etc.
This is my journey of figuring out how to automate extracting and re-packing the comic books.
Extracting Images from PDFs
I found there is a useful open source tool called pdfimages that can scan for and extract all the images present in a PDF file. The simplest way to use it is:
mkdir output_directory
pdfimages comicbook.pdf output_directory/image
This will extract all of the images and place them in an output_directory, and the images will all be named with the pattern image-000.ppm. This is useful as some CBZ readers get confused by the numbering of files, and may sort image20 after image2 and before image3. Prefixing numbers with 0s prevents this.
Among the full page images, there were also a number of narrow white rectangle images, I suspect the pdfimages tool was picking up empty page borders accidentally included in the PDF. These were easy enough to remove by using the imagemagick tool suite to query image widths and remove them accordingly. I don’t delete them, as I want to make sure I’m not missing anything important.
for file in *; do
currentWidth=$(identify -format "%[fx:w]" $file)
if [ $currentWidth -lt 200 ]; then
mv $file unwanted
fi
done
Compressing the Images
PPM and PBM are uncompressed and lossless formats of encoding images pixel by pixel. It works well to produce exactly the same image as was embedded, but is horrendous for file storage as you end up with huge images. (Individual pages were coming out at a whopping 70MB each).
I decided to do some experimenting to see which formats would be the best to use for my conversion. Most CBZ readers can handle PNG and JPEG images. For manual experimenting I used Irfanview. It has a nice interface for batch converting images, and also has plugins for optimising file sizes in many different formats.
- PNG conversion without any compression reduces the file size per image to 3.8MB
- PNG conversion at normal compression takes the file size per image to 3.5MB
- JPEG conversion at a quality of 95 reduces the file size per image to around 5MB
- JPEG conversion at a quality of 90 reduces the file size per image to around 4MB
PNG is lossless and so gives the best quality, and the file sizes produced are reasonable, but the plugin used by Irfanview, OptiPNG, is very slow at their highest compression setting. At one point processing a single image was taking several minutes to complete 1 of 240 passes and I had to kill it. I settled to just keep the normal level of compression, which can handle an image in about 40 seconds.
I was surprised to see it manage to get PNG images at a lower file size than JPEG, considering they are lossless. Doing some digging I found that the plugin is capable of stripping out colours, which works really well for reducing the size of a black and white image. For full colour images, like the front cover, it performs very poorly, with PNG producing far larger sizes, but I don’t mind that as it’s only 1 page.
Irfanview offers a setting to remove borders from images. This is useful for comic books as they often have wide borders that are a nuisance when using a smaller eReader screen. So I enabled that option.
Oops – Black and White Images
When looking through the converted images I noticed a few where the pdfimages tool had failed to successfully extract the images. The detail was displayed correctly, but the colours for some reason had been inverted (white to black, and vice versa). I needed to find a way of detecting these and correcting them, as having to go through and fix them all myself would be a pain. I decided that by looking at the top right most pixel, I could detect whether it was white (correct) or black (inverted), and with that correct all of the broken images. The imagemagick tool suite is useful for this. The following command can give relevant output:
$ convert image-000.ppm -format '%[pixel:p{0,0}]' info:-
srgb(251,243,206)
$ convert image-056.ppm -format '%[pixel:p{0,0}]' info:-
white
$ convert image-066.ppm -format '%[pixel:p{0,0}]' info:-
convert-im6.q16: unable to open image `image-066.ppm': No such file or directory
A colour pixel, a white pixel, and… oh? All of the inverted images are not showing up, why is that?
I had just made one of the most common mistakes when working with a computer – I assumed everything was behaving consistently.
It turns out that some of the images were not grayscale, but pure black and white. So the pdfimages tool helpfully extracted them using the PBM format, not PPM. This is fine. I had missed that they were using a different file extension as Irfanview can capably gobble any image file you throw at it without raising a fuss. The batch converter has an option for auto-invert, so I turned that on and processed the PPM files and PBM files separately.
Unfortunately, it seems that command line utilities do not favour these kinds of images. Every time I try to process one it throws me this error:
convert -trim -negate image-030.pbm image-030.png
convert-im6.q16: cache resources exhausted `image-071.pbm' @ error/cache.c/OpenPixelCache/4083.
convert-im6.q16: unable to read image data `image-071.pbm' @ error/pnm.c/ReadPNMImage/1346.
convert-im6.q16: no images defined `image-071.png' @ error/convert.c/ConvertImageCommand/3258
I can’t figure out what is causing this. On the face of it, it looks like it’s running out of memory space to handle the image. But even using the special arguments to increase memory space still seems to fail:
convert -limit memory 8000mb -limit disk 2gb -trim image-030.pbm image-030.png
convert-im6.q16: cache resources exhausted `image-030.pbm' @ error/cache.c/OpenPixelCache/4083.
convert-im6.q16: no images defined `image-030.png' @ error/convert.c/ConvertImageCommand/3258.
One workaround I’ve found is to use the irfanview command line interface. I’m not a fan of mixing GUI and command line, irfanview’s CLI has poor support for batch processing, confuses GUI configurations with CLI execution, and it would only work on windows. But this does work:
i_view32.exe $file /resize_long=6000 /aspectratio /invert /convert=${file%%.*}.png
It seems that for images larger than 10K pixels in any direction, imagemagick will fail, even if there are still plenty of resources on the system. This command uses the Irfanview utility to force resize it to a smaller size.
Credit where credit is due
Each of the comics I have purchased have one or more copyright pages at the start of the PDF. These are not images, but text. I feel it only right to include this, so I extract the whole page as a PPM, for inclusion in later conversion. I use the pdftoppm application:
pdftoppm -f 2 -l 3 comicfile.pdf copyright
This exports a PPM file named copyright-002.ppm and copyright-003.ppm which I can later convert alongside the other PPM images. As C comes before I, CBZ readers should display this first.
Converting everything
Once I had chosen my settings, I prepared a script I could use to go through and bulk convert all the images I wanted. And once converted, I zipped them up. Of course, each comic book will have different orderings of copyright pages, may have multiple covers, etc., so I wanted to try and make it as general purpose as possible.
I tried to only use open source utilities for Linux as much as possible, but due to the bug listed above, I had no choice to include Irfanview in the script, which makes it a bit messy to port for use on Linux systems. Had it not been for that bug, many of the plugins Irfanview uses are also available as CLI utilities, like optipng. Testing it out, it seems that whatever compression imagemagick is using is pretty good, so I decided not to re-optimise all the images it generated. I was running this as a bash script on Windows:
#!/bin/bash
filename="$1"
copy_start="$2"
copy_last="$3"
# get filename without .pdf
no_extension=${filename%%.*}
# make directory for processing images
mkdir $no_extension
# extract the images and copyright pages
echo "Extracting images to $no_extension"
pdfimages $filename $no_extension/image
pdftoppm -f $copy_start -l $copy_last $filename $no_extension/copyright
# convert all of the ppm and pbm images and
cd $no_extension
echo "Converting and removing unwanted images"
for file in *; do
currentWidth=$(identify -format "%[fx:w]" $file)
# remove the unwanted small decoration images
if [ $currentWidth -gt 100 ]; then
case "$file" in
*ppm*)
echo "Converting $file to PNG"
convert -trim $file ${file%%.*}.png
;;
*pbm*)
echo "Converting $file to PNG, inverting"
# need to use irfanview as imagemagick is crashing
i_view32.exe $file /resize_long=6000 /aspectratio /invert /convert=${file%%.*}.png
# still need to use imagemagick as irfanview has no CLI for trim
mogrify -trim ${file%%.*}.png
;;
esac
fi
# optipng -v -o2 ${file%%.*}.png
rm $file
done
# zip into a cbz file
echo "Zipping to $no_extension.cbz"
zip "$no_extension.cbz" *
mv "$no_extension.cbz" ..
cd ..
# delete the temp directory
rm -r $no_extension
Anyway, that’s my latest journey into discovering what the tools on my system can do for me, what frustrations I fell prey to, and how I eventually wrote a script make it easier for me to read my comics.
If you have any insights into how PDF files actually store their images, if you spot me using tools in inefficient ways, or anything else to add about comics, please leave a comment.
Hi,
Why don’t you use the -j option of pdfimages ?
if he detect jepeg file that write jpeg not ppm
For png not a problem that output ppm we can compress losslessly.
Thanks for your script
Yes, the -j option is useful. It looks like most of the images I was seeing were PNG, so I felt it was not so useful.