2

My Question

I'm looking for a way to convert the individual pdf pages into a byte[] (as in one byte[] per pdf page) so that I can then cast them to BufferedImage[].

This way, all the conversion is done in memory instead of making temporary files, making it faster and less messy. I may use the byte array for service calls later on as well. It would be nice if I could keep the library use to only itext, however, if there isn't any other way, I'm open to other libraries.

What I have now

This is the code that I currently have

public static BufferedImage toBufferedImage(byte[] input) throws IOException { InputStream in = new ByteArrayInputStream(input); BufferedImage bimg = ImageIO.read(in); return bimg; } public static BufferedImage[] extract(final String fileName) throws IOException { PdfReader reader = new PdfReader(fileName); int pageNum = reader.getNumberOfPages(); BufferedImage[] imgArray = new BufferedImage[pageNum]; for (int page = 0; page < pageNum; page++) { //TODO: You may need to decode the bytearray first? imgArray[page] = toBufferedImage(reader.getPageContent(pageNum)); } reader.close(); return imgArray; } public static void convert() throws IOException { String fileName = getProps("file_in"); BufferedImage[] bim = extract(fileName); // close streams; Closed implicitily by try-with-resources } 

And here's a (non-representative) list of the links that I've checked out so far.

Useful, but not quite what I want

Uses a different library

6
  • 1
    First, by "converting a PDF to an image" do you really mean "extract existing images from a PDF"? Looking at your code it appears to be about extracting and not converting. Commented Jun 17, 2016 at 16:06
  • @ChrisHaas Well the goal is to convert it. Right now, the thing that itext is doing (as far as I can tell) is making each page in the pdf a seperate jpg file. Then each jpg file is merged into a multipage tiff. I want to stop making local temporary files, and do this pdf -> byte[] -> BufferedImage -> MultipageTiff all in memory Commented Jun 17, 2016 at 17:11
  • Just like @Chris said, your question as a whole is not clear, it's partially about rendering pages as images and partially about extracting bitmap images from the pages. Itext does not (yet) include an image rendering API but it does have a bitmap extraction API. Commented Jun 17, 2016 at 17:15
  • pdf -> byte[] -> BufferedImage -> MultipageTiff - what do you expect that byte[] to contain? Commented Jun 17, 2016 at 17:16
  • I've slimmed down the question to make it more clear hopefully. @mkl So if I understand what you're saying in your first comment, itext doesn't make images, it extracts them? That's what I've For your second comment, I may have been a little unclear. I want to try to extract each page in the pdf to a seperate byte[], not the whole pdf to a single byte[]. Commented Jun 17, 2016 at 17:46

1 Answer 1

4

I did some digging and came up with a solution! Hopefully someone else finds this when they need it, and that it helps as much as possible. Cheers!

Extending the RenderListener Class

I looked around and found this. Looking through the code and classes, I found that PdfImageObjects have a getBufferedImage() which is exactly what I was looking for. Now there's no need to convert to a byte[], which is what I originally thought I was going to have to do. Using the given example code, I came up with this class:

public class MyImageRenderListener implements RenderListener { protected String path = ""; protected ArrayList<BufferedImage> bimg = new ArrayList<>(); /** * Creates a RenderListener that will look for images. */ public MyImageRenderListener(String path) { this.path = path; } public ArrayList<BufferedImage> getBimgArray() { return bimg; } /** * @see com.itextpdf.text.pdf.parser.RenderListener#renderImage( * com.itextpdf.text.pdf.parser.ImageRenderInfo) */ public void renderImage(ImageRenderInfo renderInfo) { try { PdfImageObject image = renderInfo.getImage(); if (image == null) { return; } bimg.add(image.getBufferedImage()); } catch (IOException e) { System.out.println(e.getMessage()); } } 

Important changes to notice here compared to the link above are the additions of a new field ArrayList<BufferedImage> bimg, a getter for that field, and a restructuring of the renderImage() function.

I also changed some of the methods in the other class of my project:

Code for Bursting PDF to BufferedImage[]

// Credit to Mihai. Code found here: http://stackoverflow.com/questions/6851385/save-tiff-ccittfaxdecode-from-pdf-page-using-itext-and-java public static ArrayList<BufferedImage> getBufImgArr(final String BasePath) throws IOException { PdfReader reader = new PdfReader(BasePath); PdfReaderContentParser parser = new PdfReaderContentParser(reader); MyImageRenderListener listener = new MyImageRenderListener(BasePath + "extract/image%s.%s"); for (int page = 1; page <= reader.getNumberOfPages(); page++) { parser.processContent(page, listener); } reader.close(); return listener.getBimgArray(); } 

Code for Converting BufferedImage[] to Multi-Page Tiff

public static void convert(String fin) throws FileNotFoundException, IOException { ArrayList<BufferedImage> bimgArrL = getBufImgArr(fin); BufferedImage[] bim = new BufferedImage[bimgArrL.size()]; bimgArrL.toArray(bim); try (RandomAccessOutputStream rout = new FileCacheRandomAccessOutputStream( new FileOutputStream("/path/you/want/result/to/go.tiff"))) { // The options for the tiff file are set here. // **THIS BLOCK USES THE ICAFE LIBRARY TO CONVERT TO MULTIPAGE-TIFF** // ICAFE: https://github.com/dragon66/icafe ImageParam.ImageParamBuilder builder = ImageParam.getBuilder(); TIFFOptions tiffOptions = new TIFFOptions(); tiffOptions.setApplyPredictor(true); tiffOptions.setTiffCompression(Compression.CCITTFAX4); tiffOptions.setDeflateCompressionLevel(0); builder.imageOptions(tiffOptions); TIFFTweaker.writeMultipageTIFF(rout, bim); // I found this block of code here: https://github.com/dragon66/icafe/wiki // About 3/4 of the way down the page } } 

To kick off the whole process:

public static void main(String[] args){ convert("/path/to/pdf/image.pdf"); } 

IMPORTANT TO NOTE:

You may notice that listener.renderImage() is never explicitly called in my code. It seems that renderImage() is a helper function that is called somewhere else when the listener object is passed into the parser object. This happens in the getBufImgArr(param) method.

As @mkl in the comments below has noted, the code is extracting all images in the pdf page, since a pdf page isn't an image in and of itself. Problems may occur if you're running this code on pdf's that were scanned in using OCR, or pdf's that have multiple layers. In this scenario, you'd have multiple images from a single pdf page being converted into multiple tiff images, when you (may) want them to stay together on a single page.

Good sources I found:

Programcreek search for PdfReaderContentParser

Sign up to request clarification or add additional context in comments.

2 Comments

In contrast to your question your code (A) in general does not render the whole page but instead only extracts the bitmap images from the page --- in case of scanned PDFs these notions may coincide, though --- and B there are no byte arrays visible at all here. If you had asked for a way to extract embedded bitmap images from a PDF from the beginning, you'd have had an answer very quickly.
It seems I didn't quite get what pdf's were / how they stored data inside. It makes more sense now to think of these certain pdf files as images inside a container. As for B, that seems to be a result of me not fully understanding the problem scope, and what was actually needed. Either way, I appreciate you taking the time to explain and help me out @mkl

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.