read pdf files using java

Question

I want to parse pdf websites.

Can anyone say how to extract all the words (word by word) from a pdf file using java.

The code below extract content from a pdf file and write it in another pdf file. I want that the program write it in a text file.

import java.io.FileOutputStream; import java.io.IOException; import com.itextpdf.text.*; import com.itextpdf.text.pdf.*; public class pdf { private static String INPUTFILE = "http://www.britishcouncil.org/learning-infosheets-medicine.pdf" ; private static String OUTPUTFILE = "c:/new3.pdf"; public static void main(String[] args) throws DocumentException, IOException { Document document = new Document(); PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(OUTPUTFILE)); document.open(); PdfReader reader = new PdfReader(INPUTFILE); int n = reader.getNumberOfPages(); PdfImportedPage page; for (int i = 1; i <= n; i++) { page = writer.getImportedPage(reader, i); Image instance = Image.getInstance(page); document.add(instance); } document.close(); } }

Thanks in advance

possible duplicate of How to read PDF files using java

Travis
– Travis

2015-03-12 13:36:38 +00:00
Commented Mar 12, 2015 at 13:36 — Travis
– Travis, Commented Mar 12, 2015 at 13:36

Leniel Maccaferri · Accepted Answer · 2010-10-25 14:26:05Z

2

Take a look at this:

How to Read PDF File in Java (uses Apache PDF Box library)

answered Oct 25, 2010 at 14:26

Leniel Maccaferri

103k48 gold badges381 silver badges495 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

dina · Accepted Answer · 2017-02-15 20:37:25Z

using org.apache.pdfbox

import org.apache.pdfbox.*; public static String convertPDFToTxt(String filePath) { byte[] thePDFFileBytes = readFileAsBytes(filePath); PDDocument pddDoc = PDDocument.load(thePDFFileBytes); PDFTextStripper reader = new PDFTextStripper(); String pageText = reader.getText(pddDoc); pddDoc.close(); return pageText; } private static byte[] readFileAsBytes(String filePath) { FileInputStream inputStream = new FileInputStream(filePath); return IOUtils.toByteArray(inputStream); }

Can I read a pdf file partially? for example, only the first page, or until a certain text occurance, rather than reading the whole pdf file? so I can avoid downloading the whole file.

Collectives™ on Stack Overflow

read pdf files using java

2 Answers 2

Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Linked

Related