0

I’ve recently gotten into scraping (and programming in general) for my internship, and I came across PDF scraping. Every time I try to read a scanned pdf with R, I can never get it to work. I’ve tried using the file.choose() function to no avail. Do I need to change my directory, or how can I get the pdf from my files into R? The code looks something like this:

 > library(pdftools) > text=pdf_text("C:/Users/myname/Documents/renewalscan.pdf") > text [1] "" 

Also, using pdftables leads me here:

 > library(pdftables) > convert_pdf("C:/Users/myname/Documents/renewalscan.pdf","my.csv") Error in get_content(input_file, format, api_key) : Bad Request (HTTP 400). 
3
  • 2
    What are you trying to scrape it with? Is there non-image text in it to scrape, or is it in an image? This question isn't answerable without a reproducible example. Commented Jun 7, 2018 at 20:36
  • Apologies, I’m using pdftools and tm, and was trying to follow along with what’s said in medium.com/@CharlesBordet/…’m-r-da11964e252e. Normally, a file is downloaded from the web, but I have the file already on my computer. Also, it is a table in pdf form. Commented Jun 7, 2018 at 20:49
  • Similar thread here: stackoverflow.com/questions/51312453/… Commented Feb 28, 2019 at 11:19

4 Answers 4

3

You should use the packages pdftools and pdftables.

If you are trying to read text inside the pdf, then use pdf_text() function. What goes inside is the path (in your computer or web) to the pdf. For example

tt = pdf_text("C:/Users/Smith/Documents/my_file.pdf") 

It would be nice if you were more specif and also give us reproducible example.

Sign up to request clarification or add additional context in comments.

1 Comment

I apologize for the lack of clarity. This is my first post here, so I’m trying to get the hang of it all. Im editing the post now to show my code.
0

To use the PDFTables R package, you need to the run the following command:

convert_pdf('test/index.pdf', output_file = NULL, format = "xlsx-single", message = TRUE, api_key = "insert_API_key") 

Comments

0

If you are looking to get tabular data, you might try tabulizer. Here is a full code tutorial: https://www.business-science.io/code-tools/2019/09/23/tabulizer-pdf-scraping.html

Basically, you can use this code from the tutorial:

library(tabulizer) extract_tables( file = "2019-09-23-tabulizer/endangered_species.pdf", method = "decide", output = "data.frame") 

Comments

0

export to word... copy and paste into excel... read into R... then go ahead and go to pdftables.com and inform them how ashamed they should be of themselves...

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.