How to scrape a downloaded PDF file with R [duplicate]

Question

I’ve recently gotten into scraping (and programming in general) for my internship, and I came across PDF scraping. Every time I try to read a scanned pdf with R, I can never get it to work. I’ve tried using the file.choose() function to no avail. Do I need to change my directory, or how can I get the pdf from my files into R? The code looks something like this:

 > library(pdftools) > text=pdf_text("C:/Users/myname/Documents/renewalscan.pdf") > text [1] ""

Also, using pdftables leads me here:

 > library(pdftables) > convert_pdf("C:/Users/myname/Documents/renewalscan.pdf","my.csv") Error in get_content(input_file, format, api_key) : Bad Request (HTTP 400).

What are you trying to scrape it with? Is there non-image text in it to scrape, or is it in an image? This question isn't answerable without a reproducible example. — alistaire
– alistaire, Commented Jun 7, 2018 at 20:36
Apologies, I’m using pdftools and tm, and was trying to follow along with what’s said in medium.com/@CharlesBordet/…’m-r-da11964e252e. Normally, a file is downloaded from the web, but I have the file already on my computer. Also, it is a table in pdf form. — Thomas Campbell
– Thomas Campbell, Commented Jun 7, 2018 at 20:49
Similar thread here: stackoverflow.com/questions/51312453/… — mphil4
– mphil4, Commented Feb 28, 2019 at 11:19

Giovana Stein · Accepted Answer · 2018-06-07 20:52:23Z

You should use the packages pdftools and pdftables.

If you are trying to read text inside the pdf, then use pdf_text() function. What goes inside is the path (in your computer or web) to the pdf. For example

tt = pdf_text("C:/Users/Smith/Documents/my_file.pdf")

It would be nice if you were more specif and also give us reproducible example.

I apologize for the lack of clarity. This is my first post here, so I’m trying to get the hang of it all. Im editing the post now to show my code.

mphil4 · Accepted Answer · 2019-03-29 07:33:37Z

To use the PDFTables R package, you need to the run the following command:

convert_pdf('test/index.pdf', output_file = NULL, format = "xlsx-single", message = TRUE, api_key = "insert_API_key")

Matt Dancho · Accepted Answer · 2019-09-24 15:51:43Z

If you are looking to get tabular data, you might try tabulizer. Here is a full code tutorial: https://www.business-science.io/code-tools/2019/09/23/tabulizer-pdf-scraping.html

Basically, you can use this code from the tutorial:

library(tabulizer) extract_tables( file = "2019-09-23-tabulizer/endangered_species.pdf", method = "decide", output = "data.frame")

roberty boberty · Accepted Answer · 2024-08-12 16:14:11Z

export to word... copy and paste into excel... read into R... then go ahead and go to pdftables.com and inform them how ashamed they should be of themselves...

Collectives™ on Stack Overflow

How to scrape a downloaded PDF file with R [duplicate]

4 Answers 4

1 Comment

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

Comments

Comments

Comments

Linked

Related