Creating PDFs From Page Images
Suppose you have a book published in 1922 or earlier that you want to donate to the Internet Archive. They require that submissions be in PDF format. For now, assume that you have created images in JPG format for all the pages and they are named sequentially. You can make a PDF out of them using Image Magick.
This is the command to create a PDF from a set of sequentially named images:
convert -verbose *.jpg my_e-bookname.pdf
This will take all the JPEG files in the current working directory and put them into a PDF. If you have a very short book, like a children's book, this is all you need. If you try to run this on a book with hundreds of pages it will fail with an out of memory error (or on Linux a segmentation fault). The way around that is to make a PDF out of each page image, then join those PDFs together. We use a different Image Magick command to make the PDFs:
mogrify -verbose -format pdf *.jpg
To join the PDFs together we need another piece of software, called pdftk. You can download that here:
The command you use to join the PDFs is this:
pdftk *.pdf cat output BookTitle.pdf
When you run this you may see many warning messages about the possibility of memory leaks. These messages should be safe to ignore.
Here is a PDF I made this way, viewed in Acrobat Reader:
Making Your PDF's Smaller
If you created a PDF from page images you may be a bit dismayed at how large the file is. One hundred and fifty megabytes for a three hundred page book is not uncommon.
If you look at the files available for each book at the Internet Archive, you'll see entries like this:
worksofjulesvern02vern.djvu 9686664 worksofjulesvern02vern.pdf 21892098 worksofjulesvern02vern_bw.pdf 17715851 worksofjulesvern02vern_jp2.zip 170943817 worksofjulesvern02vern_orig_jp2.tar 253030400
We can interpret this as follows:
- The uncropped images from the book scanner take up about 253 megabytes.
- The cropped images take up about 170 megabytes.
- The finished PDF takes up about 21 megabytes.
- A black and white version of the same PDF is a little under 18 megabytes.
- The DjVu version is smallest of all, at 9.6 megabytes.
How is this possible? I couldn't figure it out myself so I sent an email to the authors of the software the Internet Archive uses and it got forwarded on to the person who developed the PDF creating software. He was kind enough to explain the whole process, which I will paraphrase and simplify here.
The main secret of the process is that it divides each page image into three separate images which are combined to create the page you see in the PDF. These images are:
- The text in the book, stored as a black and white image at high quality. Since there are only two colors used even a high quality image takes up little space.
- The image layer, which is "downsampled" to a lower resolution than the original photograph to save space. On a computer screen the difference between the original image and the downsampled image is not noticeable.
- The page background, which is the bulk of the page area, is stored very highly compressed. The effect of this is to make the page background a more uniform color than the original book had, but that is not a problem.
If you read a PDF like the book Abroad which has highly decorated pages you can actually see the three layers coming into view separately.
This process is more complex than anything the home e-book maker would attempt. That does not mean that we cannot make our e-books dramatically smaller without losing an objectionable amount of quality, but we'll have to use simpler techniques. The key is to make the original page images smaller and more highly compressed. Once you do that you can make a PDF much smaller than the ones we can create with the original images.
If you're preparing the e-book for donation to the Internet Archive they're going to want the full sized PDF. They will of course prepare a new PDF which is smaller and has OCR'd text behind each page.
If the book is not going to the Internet Archive, you'll need to shrink the pages images yourself.
Optimizing Page Sizes
One thing you can and should do when creating e-books from images is to first resize the pages so they are no larger than your screen can display. On an XO laptop the screen width is 1200 pixels. The page images I created with a Kodak 5 megapixel camera are a little over 1200 pixels wide once the images are rotated and trimmed. The difference is probably not worth bothering with. Pictures taken with an 8 megapixel camera are a different story.
The width of the screen is the important factor when choosing what size your images should be, since pages scroll vertically. Load one of your images into The GIMP or Picasa to see how wide it is in pixels. Figure out what width in pixels you want your images to be (the screen width of the XO laptop is 1200 pixels), then run the mogrify command from Image Magick on them like this:
mogrify -resize 1200 -format jpg -quality 80% -verbose *.jpg
Note that mogrify will update your images in place, so you definitely want to back up the originals to CD first as well as copy them to a new directory. You may want to experiment with the -quality setting. The JPEG format does what is known as "lossy" compression. This means it gets a smaller file size by removing detail from the picture.
This might be hard to imagine, but suppose you have a photograph. JPEG's can display 16.7 million colors but the human eye can't always distinguish them. If there is a blue sky in the photograph the sky won't be all the same color. Say there are 1,000 shades of blue in the sky. If you averaged out the colors so that only 256 shades were used you might not be able to tell the difference, but the amount of information in the picture would go down noticeably, resulting in a much smaller image.
80% quality will generally give good results, but you should experiment. You might experiment with image sizes too. Comic book zips rarely contain images wider than 900 pixels, yet they look good enlarged.
Space savings can be significant. The original files for this book took up 173.9 megabytes. The resized files take up 69.3 megabytes. That's not as good as the Internet Archive does, but it's a decent improvement. You can experiment with different quality levels to see how much you can compress your JPEG's without hurting quality. You might use a lower quality for text pages and a higher one for color illustrations, or vice versa.The resized PDF looks as good as the original:
If you want to make your book still smaller you can make a DjVu document out of the resized images.
Of course if you are really serious about making smaller PDF's you'll want to do OCR on the scanned pages to get plain text, then use your word processor to make a PDF out of that text. Doing that will be covered in the chapter on Plain Text files.
Correcting Page Sizes
It is very likely that your cropped page images will not all be the same size. Quite often this is not a problem, but sometimes your PDF's will look like this when you try to read them:
Some of the books scanned by Microsoft and Google and uploaded to the Internet Archive have this problem. You can fix it by making all your page images the same width and re-creating the PDF. The mogrify command used to resize images in this chapter can be used for this, with some simple modifications. You need to change the -resize parameter to something slightly smaller than your page images. If most of your pages are 1295 or so, make the width 1200. You can leave off the -quality parameter (which will leave the original quality of the image unchanged). This will make all your pages the same width, and the PDF you make from those images should not have the problem shown above.