PDF scraping using R

Question

I have been using the XML package successfully for extracting HTML tables but want to extend to PDF's. From previous questions it does not appear that there is a simple R solution but wondered if there had been any recent developments

Failing that, is there some way in Python (in which I am a complete Novice) to obtain and manipulate pdfs so that I could finish the job off with the R XML package

Spacedman · Accepted Answer · 2011-10-27 16:05:11Z

11

Extracting text from PDFs is hard, and nearly always requires lots of care.

I'd start with the command line tools such as pdftotext and see what they spit out. The problem is that PDFs can store the text in any order, can use awkward font encodings, and can do things like use ligature characters (the joined up 'ff' and 'ij' that you see in proper typesetting) to throw you.

pdftotext is installable on any Linux system...

answered Oct 27, 2011 at 16:05

Spacedman

94.7k12 gold badges148 silver badges231 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Iterator Over a year ago

Seconded. Doing it in R isn't worth the effort of anyone to develop and maintain, when there are options that are far better maintained outside of R. If you need to do a lot of files, try using the find utility in Unix (or in the GNU collection for Windows), or one can have R send commands to the shell, looping over filenames... Even Adobe had a terrible text extractor for a long time (not sure if it's better now), while Xerox had a good one.

pssguy Over a year ago

Well pdftotext works fine in producing a clean text page but its not in any sort of form to easily create what i want. Thanks anyways

Spacedman Over a year ago

Running pdftotext isn't brilliant on that page, but converting to ps first or just running ps2txt on the PDF produces an almost perfect table with some page heads/foots to remove.

JankoWTF · Accepted Answer · 2011-10-28 07:18:30Z

Your might want to check out the text mining package tm. I recall that they implemented so called readers, and there also was one for PDFs.

Richie Cotton · Accepted Answer · 2011-10-27 16:10:11Z

AFAIK there isn't an easy way of turning PDF tables into something useful for data analysis. You can use the Data Science Toolkit's File to Text utility (R interface via the RDSTK package), then parse the resulting text. Be warned: the parsing is often non-trivial.

EDIT: There's a useful discussion of converting PDFs to XML on discerning.com. The short answer is that you will probably need to buy a commercial tool.

+1 Thanks for that. i checked the discussion and tried downloading the ABBYY product on trial but it would not set-up proprrly. Guess I'm doomed

psychemedia · Accepted Answer · 2016-05-02 13:27:15Z

The heart of the tabula application that can extract tables from PDF documents is available as a simple command line Java application, tabula-extractor.

This Java app has been wrapped in R by the tabulizer package. Pass it the path to a PDF file and it will try to extract data tables for you and return them as data.

For an example, see When Documents Become Databases – Tabulizer R Wrapper for Tabula PDF Table Extractor.

Collectives™ on Stack Overflow

PDF scraping using R

4 Answers 4

3 Comments

Comments

1 Comment

Comments

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

Comments

1 Comment

Comments

Related