Java Apache PDFBox Convert Multipage Tiff To PDF

Java

Apr 3, 2015

4 10930

This week I got to research and do some work on figuring out the best way to take a multipage tiff file and convert it to a PDF format. When I first started on this, I went immediately to iText as that was the only library I was familiar with. After getting a working example going, I checked out the license for the most current version of iText and realized that it cannot be used in any closed source applications without buying the license. This is what led me to begin searching for other Java PDF libraries.

I came across Apache PDFBox and saw that it had the ability to add images to the PDF, and even had a class to add a TIFF. I gave the TIFF example a try and I got a complaint about unsupported compression. After pondering this for a while, I had the thought of reading in each page of the tiff as a BufferedImage and then placing each one into the PDF as a JPG. This actually requires 2 libraries.

The commons imaging can be substituted for something else if you can get a List from the Tiff. It is currently not released and is in the apache sandbox as a snapshot. Here is the maven dependency for it with the repository:


<repository>
  <id>apache.snapshots</id>
  <name>Apache Development Snapshot Repository</name>
  <url>https://repository.apache.org/content/repositories/snapshots/</url>
  <releases>
    <enabled>false</enabled>
  </releases>
  <snapshots>
    <enabled>true</enabled>
  </snapshots>
</repository>  

        <dependency>
          <groupId>org.apache.commons</groupId>
          <artifactId>commons-imaging</artifactId>
          <version>1.0-SNAPSHOT</version>
        </dependency>

Here is the Maven dependency for pdfbox

        <dependency>
          <groupId>org.apache.pdfbox</groupId>
          <artifactId>pdfbox</artifactId>
          <version>1.8.9</version>
        </dependency>

Here is a example class that can be run against a directory of tiff images. All tiff files in the directory will be converted over to PDF.

package com.iws.export;

import java.awt.Dimension;
import java.awt.image.BufferedImage;
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.nio.channels.FileChannel;
import java.util.Iterator;
import java.util.List;

import javax.imageio.IIOImage;
import javax.imageio.ImageIO;
import javax.imageio.ImageWriteParam;
import javax.imageio.ImageWriter;

import org.apache.commons.imaging.Imaging;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.edit.PDPageContentStream;
import org.apache.pdfbox.pdmodel.graphics.xobject.PDJpeg;
import org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage;

import com.google.common.base.Stopwatch;
import com.itextpdf.text.Document;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.Image;
import com.itextpdf.text.Rectangle;
import com.itextpdf.text.io.FileChannelRandomAccessSource;
import com.itextpdf.text.pdf.PdfWriter;
import com.itextpdf.text.pdf.RandomAccessFileOrArray;
import com.itextpdf.text.pdf.codec.TiffImage;

public class QueueToPdf {

public static void main(String[] args) {
try {
new QueueToPdf().generatePdfFromTifPbox(new File("/mydir"));
} catch (Exception e) {
e.printStackTrace();
}
}

public void generatePdfFromTifPbox(File dir) {
for(final File f : dir.listFiles()) {

try (
PDDocument doc = new PDDocument();
)
{

List bimages = Imaging.getAllBufferedImages(f);
for(BufferedImage bi : bimages) {
PDPage page = new PDPage();
doc.addPage( page );
PDPageContentStream contentStream = new PDPageContentStream(doc, page);
try {
//the .08F can be tweaked. Go up for better quality, but the size of the PDF will increase
PDXObjectImage image = new PDJpeg(doc, bi, .08F);

Dimension scaledDim = getScaledDimension(new Dimension(image.getWidth(), image.getHeight()), page.getMediaBox().createDimension());
contentStream.drawXObject(image, 1, 1, scaledDim.width, scaledDim.height);
} finally {
contentStream.close();
}

}

doc.save( f.getAbsolutePath() + ".pdf");

} catch (Exception e) {
e.printStackTrace();
}
}
}

//taken from a stack overflow post http://stackoverflow.com/questions/23223716/scaled-image-blurry-in-pdfbox
//Thanks Gyo!
private Dimension getScaledDimension(Dimension imgSize, Dimension boundary) {
int original_width = imgSize.width;
int original_height = imgSize.height;
int bound_width = boundary.width;
int bound_height = boundary.height;
int new_width = original_width;
int new_height = original_height;

// first check if we need to scale width
if (original_width > bound_width) {
//scale width to fit
new_width = bound_width;
//scale height to maintain aspect ratio
new_height = (new_width * original_height) / original_width;
}

// then check if we need to scale even with the new height
if (new_height > bound_height) {
//scale height to fit instead
new_height = bound_height;
//scale width to maintain aspect ratio
new_width = (new_height * original_width) / original_height;
}

return new Dimension(new_width, new_height);
}

}

The only complaint I have about this solution is that it is doubling the size of the tiff at the currently JPG quality level. I was impressed that when I first tried iText, the PDF actually ended up being slightly smaller than the tiff file I was converting. I am not sure if there is some other way the jpeg could be compressed, or if it is just the tiff compression being better than jpeg that is doing it. Please feel free to share your thoughts, or other solutions.

Maintain Large Java Caches With Low Memory Overhead

Java

Feb 27, 2015

0 3686

Wow, it has been a month since my last post. Thought I would come on and provide everyone with an update of what I have been working on lately.

For the past week and a half I had the pleasure of trying to get a Java cache implementation working. I was caching 2 very large tables down locally from one of our applications at work that I am pulling data from. The system I am working with has a very simplistic SQL engine to access its files, but it does not support sub-queries. This usually leads to a lot of extra queries being done in a loop, putting extra load on the DB, and slowing down the process of getting at the data I want.

The light bulb finally went on when I had the idea to just cache these 2 tables using the PK as the cache key and for the value a map with the column names and value from the db for each row in the table. I had used EHCache in the past with success for some in memory caching, so I thought, I will just throw EHCache at this and be done. I basically wanted an eternal cache so I did not have to rebuild the entire cache from scratch every time I needed it. I got my cache all setup and built out, and caching to the disk. The problem was the cache was getting cleared out every time the JVM reloaded. I researched this a bit and found out I was missing an option on my cache setup to tell EHCache to reload from disk on start up. Ah ha I thought, this is going to totally rock, and then this happened:

RuntimeException: This feature is only available in the enterprise version

Nooooooo, how could this be, lol! I checked out the pricing for the Enterprise version and you had to call. We were not wanting to spend a lot of money on this project so I knew EHCache was not going to be working out for us at this point. I have nothing against paying for software by the way, and I am sure that EHCache is worth the money if you have the money to throw at it.

Well after lots of Googling and reading, I landed on Infinispan which also looked really nice. I really liked that this had a little bit more of a simplistic way of moving data in and out of the Cache than EHCache. With EHCache, everything has to be wrapped in Element like so:

//putting
Element e = new Element("myKey", objectToCache);
cache.put(e);

//getting
Element e = cache.get("myKey");
e.getObjectValue();

It was much more Map like with Infinispan

cache.put("myKey", objectToCache);

cache.get("myKey");

I did a simple test with Infinispan and was able to push a value to the cache, shut the cache down, and retrieve the value again when I fired the cache back up. Another wave of excitement sweeps over me as I move my existing EhCache code over to the Infinispan API. I am able to use their SingleFileCacheStore and build out the cache. I then start to use the cache and delight at the speed at which it is flying through the data not having to do my extra DB queries. But then I watch as it starts to go slower and slower GC, GC, GC, Out of Memory. NOOOoooooo! The objects that were getting serialized from the cache never seemed to get garbage collected. I spent a while researching this and ran into a bunch of dead ends.

Back to The Google I go….

I briefly tried JCS, but could not get the cache to reload from disk on start up. Thought maybe it was something stupid in my configuration. Went through the docs multiple times and posted a question on Stackoverflow with no success.

Yup you guessed it, back to Google AGAIN!

I finally landed on this gem, MapDB. It is actually built to be a database to store Java Objects and implements the Java Collections so you can work with it as a Map, List, etc. As of right now, I could not be happier. My Infinispan cache size was a 2GB file. This file is 500mb caching the same data. It is a little bit slower pushing the data into it. It has a Async option for writes that speeds it up quite a bit, but I was getting a concurrent modification exception when I had it turned on. I just shut it down for now as I was not really concerned with the speed at which my cache got created. I did not notice any significant speed differences when pulling the objects out of the DB. The memory usage stays very low, and I am quiet impressed with how fast they can go to the file and pull the information out without having it in memory. Here is the JavaDoc for anyone interested.

I would be very interested to know your thoughts / experiences with similar implementations.

java

Share this:

Share this: