How to make text invisible in an existing PDF

Question

I want to make all the text in an existing PDF transparent.

Option 1: select all the text, find a color property and change it to "colorless"

Or, if there is no such property

Option 2: Parse the page content Stream and all Form XObjects for that page, detect text blocks (BT/ET), and set the render mode to invisble.

This seems to be a complex operation.

Here is my example file

The following code is generating PDF(example pdf file):

 Document document = new Document(new Rectangle(width, height)); PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(filename)); document.open(); PdfContentByte picCanvas = null; PdfContentByte txtCanvas = null; if (isUnderPic) { txtCanvas = writer.getDirectContentUnder(); picCanvas = writer.getDirectContent(); } else { txtCanvas = writer.getDirectContent(); picCanvas = writer.getDirectContentUnder(); } BaseFont bf = null; if (null != pageList) { int[] dpi = { 0, 0 }; if (dpiType == 1) { dpi[0] = 300; dpi[1] = 300; } else if (dpiType == 2) { dpi[0] = 600; dpi[1] = 600; } for (int i = 0; i < pageList.size(); i++) { PDFPage page = pageList.get(i); Image pageImage = null; if (pdfType == 3) { pageImage = Image.getInstance(page.getBinImage()); } else { pageImage = Image.getInstance(page.getOriImage()); } if (pageImage.getWidth() > 0) { pageImage.scaleAbsolute(page.getWidth(), page.getHeight()); } pageImage.setAbsolutePosition(0, 0); picCanvas.addImage(pageImage); if (pdfType == 2 || pdfType == 3) { for (PageElement ele : page.getElementList()) { if (ele.getType().equals(PDFConstant.ElementType.PDF_ELEMENT_CHAR)) { txtCanvas.beginText(); if (isColor) { txtCanvas.setTextRenderingMode(PdfContentByte.TEXT_RENDER_MODE_FILL); txtCanvas.setColorFill(BaseColor.RED); } else { txtCanvas.setTextRenderingMode(PdfContentByte.TEXT_RENDER_MODE_INVISIBLE); } String font = ele.getFont(); try { bf = fonts.get(font); if (null == bf) { bf = BaseFont.createFont(font, "UniGB-UCS2-H", BaseFont.NOT_EMBEDDED); fonts.put(font, bf); } } catch (Exception e) { bf = BaseFont.createFont("STSong-Light", "UniGB-UCS2-H", BaseFont.NOT_EMBEDDED); fonts.put(font, bf); } txtCanvas.setFontAndSize(bf, ele.getFontSize()); txtCanvas.setTextMatrix(ele.getPageX(), ele.getPageY(page.getRcInPage())); txtCanvas.showText(ele.getCode()); txtCanvas.endText(); } } } if (StringUtils.isNotBlank(cutPath)) { for (PageElement ele : page.getElementList()) { if (ele.getType().equals(PDFConstant.ElementType.PDF_ELEMENT_PIC) && StringUtils.isNotBlank(ele.getCutPicSrc())) { ImageTools.cutPic(ele.getRcInImage(), page.getOriImage(), ele.getCutPicSrc(), dpi); } } } if (pdfType == 3) { logger.debug("pdfType == 3"); for (PageElement ele : page.getElementList()) { if (ele.getType().equals(PDFConstant.ElementType.PDF_ELEMENT_PIC) && StringUtils.isNotBlank(ele.getCutPicSrc())) { if (new File(ele.getCutPicSrc()).exists()) { Image cutCover = Image.getInstance(ImageTools.drawImage((int) ele.getWidth(), (int) ele.getHeight())); if (cutCover.getWidth() > 0) { cutCover.scaleAbsolute(ele.getWidth(), ele.getHeight()); } cutCover.setAbsolutePosition(ele.getPageX(), ele.getPageY(page.getRcInPage())); picCanvas.addImage(cutCover); Image pic = Image.getInstance(ele.getCutPicSrc()); if (pic.getWidth() > 0) { pic.scaleAbsolute(ele.getWidth(), ele.getHeight()); } pic.setAbsolutePosition(ele.getPageX(), ele.getPageY(page.getRcInPage())); picCanvas.addImage(pic); } } } } if (i + 1 < pageList.size()) { document.setPageSize(new Rectangle(pageList.get(i + 1).getWidth(), pageList.get(i + 1).getHeight())); } else { document.setPageSize(new Rectangle(pageList.get(i).getWidth(), pageList.get(i).getHeight())); } document.newPage(); } } document.close();

I updated the question because the language wasn't clear. You also made some false allegations: content won't get lost if the text render mode is changed, it doesn't make a difference if your text is in Chinese. I kept Option 1, but I don't see how it makes sense (to me, it's identical to option 2). I would have expected an option involving optional content (although that may not help you much). — Bruno Lowagie
– Bruno Lowagie, Commented Feb 12, 2014 at 7:39
Forget about it. I've just looked at your PDF. You can't remove the text: the text is an image! — Bruno Lowagie
– Bruno Lowagie, Commented Feb 12, 2014 at 7:42
@BrunoLowagie thanks a lot for your help! excuse me , My English is poor , the under layer is a image , the upper is text , total two layer — andlu
– andlu, Commented Feb 12, 2014 at 7:53
Yes, and you are asking to remove the text from the image, right? That's not possible with PDF software, you need image software to do that. — Bruno Lowagie
– Bruno Lowagie, Commented Feb 12, 2014 at 8:05
I've updated my answer. I've extracted the image and I've pasted it in my answer. That image is what you call the "image layer". — Bruno Lowagie
– Bruno Lowagie, Commented Feb 12, 2014 at 8:09

Bruno Lowagie · Accepted Answer · 2014-02-12 10:42:10Z

I've taken a look at your PDF and I see that the PDF is a scanned image. The text isn't really text: it consists of an image. Your question is invalid because it assumes that the text consists of vector data (defined using PDF syntax, such as BT and ET). In reality, the text is a bunch of pixels and any pixel doesn't know whether it belongs to a text glyph or an image. In short: you're using the wrong approach. You are trying to solve a problem using PDF software whereas you should be using a tool that manipulates raster images.

This is the image I extracted from the PDF:

enter image description here

The OP claims that there are two layers: one with an image, one with text. That may very well be true, but the image also contains rasterized text and it is impossible to remove that text from the image by changing the PDF syntax.

You may be able to cover the text if you know the coordinates, but that will largely depend on the accuracy of the OCR operation.

If your requirement is not to cover the text in the image, but the text of the vector layer, it's sufficient to add the syntax that adds the image after the syntax that adds the vector text. If the image is opaque, it will cover all the text. This is done in the RepeatImage example:

PdfReader reader = new PdfReader(src); // We assume that there's a single large picture on the first page PdfDictionary page = reader.getPageN(1); PdfDictionary resources = page.getAsDict(PdfName.RESOURCES); PdfDictionary xobjects = resources.getAsDict(PdfName.XOBJECT); PdfName imgName = xobjects.getKeys().iterator().next(); Image img = Image.getInstance((PRIndirectReference)xobjects.getAsIndirectObject(imgName)); img.setAbsolutePosition(0, 0); img.scaleAbsolute(reader.getPageSize(1)); PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest)); stamper.getOverContent(1).addImage(img); stamper.close(); reader.close();

Take a look at the resulting PDF; now you can still select the vector text, but it's no longer visible.

I update my question: add the code of generating PDF(example pdf file)
While I upvoted (good analysis, a solution for the document at hand using only the higher level iTeaxt APIs, no low-level stream editing), it would have been more according to the SO spirit if the solution code would (also) be here in the answer, not only reachable via a link
iText isn't donation-ware, but we are looking for ways to gain traction on the Chinese market. This isn't a discussion for StackOverflow though. It's better to move this conversation to sales.

Collectives™ on Stack Overflow

How to make text invisible in an existing PDF

1 Answer 1

4 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Related