10

I have been using the XML package successfully for extracting HTML tables but want to extend to PDF's. From previous questions it does not appear that there is a simple R solution but wondered if there had been any recent developments

Failing that, is there some way in Python (in which I am a complete Novice) to obtain and manipulate pdfs so that I could finish the job off with the R XML package

4 Answers 4

11

Extracting text from PDFs is hard, and nearly always requires lots of care.

I'd start with the command line tools such as pdftotext and see what they spit out. The problem is that PDFs can store the text in any order, can use awkward font encodings, and can do things like use ligature characters (the joined up 'ff' and 'ij' that you see in proper typesetting) to throw you.

pdftotext is installable on any Linux system...

Sign up to request clarification or add additional context in comments.

3 Comments

Seconded. Doing it in R isn't worth the effort of anyone to develop and maintain, when there are options that are far better maintained outside of R. If you need to do a lot of files, try using the find utility in Unix (or in the GNU collection for Windows), or one can have R send commands to the shell, looping over filenames... Even Adobe had a terrible text extractor for a long time (not sure if it's better now), while Xerox had a good one.
Well pdftotext works fine in producing a clean text page but its not in any sort of form to easily create what i want. Thanks anyways
Running pdftotext isn't brilliant on that page, but converting to ps first or just running ps2txt on the PDF produces an almost perfect table with some page heads/foots to remove.
5

Your might want to check out the text mining package tm. I recall that they implemented so called readers, and there also was one for PDFs.

Comments

4

AFAIK there isn't an easy way of turning PDF tables into something useful for data analysis. You can use the Data Science Toolkit's File to Text utility (R interface via the RDSTK package), then parse the resulting text. Be warned: the parsing is often non-trivial.


EDIT: There's a useful discussion of converting PDFs to XML on discerning.com. The short answer is that you will probably need to buy a commercial tool.

1 Comment

+1 Thanks for that. i checked the discussion and tried downloading the ABBYY product on trial but it would not set-up proprrly. Guess I'm doomed
1

The heart of the tabula application that can extract tables from PDF documents is available as a simple command line Java application, tabula-extractor.

This Java app has been wrapped in R by the tabulizer package. Pass it the path to a PDF file and it will try to extract data tables for you and return them as data.

For an example, see When Documents Become Databases – Tabulizer R Wrapper for Tabula PDF Table Extractor.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.