2

I need to download a series of Excel files from URL's that all look as follows:

http://example.com/orResultsED.cfm?MODE=exED&ED=01&EventId=31 http://example.com/orResultsED.cfm?MODE=exED&ED=02&EventId=31 ... http://example.com/orResultsED.cfm?MODE=exED&ED=87&EventId=31 

 

I've got some of the building blocks inside the loop, such as:

for(i in 1:87) { url <- paste0("http://example.com/orResultsED.cfm?MODE=exED&ED=", i, "&EventId=31") file <- paste0("Data/myExcel_", i, ".xlsx") if (!file.exists(file)) download.file(url, file) } 

 

My problems:

  1. I need the seq to prepend the 0 (I tried sprintf with no luck)
  2. I also want to import the Excel files, skip the first two rows and append them on after the other (they also have the same columns)

 

Update

@akrun solution works well. But it turns out not all my Excel files have the same number of columns:

map(files, ~read.xlsx(.x, colNames = FALSE, sheet = 1, startRow = 4, )) %>% bind_rows Error in bind_rows_(x, .id) : Column `X1` can't be converted from numeric to character 

I think this error actually points to the unequal number of column. I tried adding fill = NA (when testing map_df()), but it didn't help.

3
  • 1
    FWIW, provided you're not trying to make $, "elections alberta" actually goes out of its way to inform you they allow personal and educational scraping. Please try to be kind and add in a Sys.sleep(5) tho. There's no need to download files faster than that.There's also a really good chance one email could end up giving you a ZIP file or even a set of SQL to load these up w/o scraping. But, hey, nice try with example.com. Commented Jan 29, 2018 at 1:30
  • Thanks! I'm pretty new to web scraping. I didn't know one could ask for SQL access. The example.com was more to shorten the link and make it fit in one line in SO than anything else! Commented Jan 29, 2018 at 1:41
  • 1
    Ohh and I learned the if (!file.exists(file)) download.file(url, file) from your nuclear animation example! Thanks for chiming in! Commented Jan 29, 2018 at 1:45

3 Answers 3

5

We can create it with sprintf

paste0("http://example.com/orResultsED.cfm?MODE=exED&ED=", sprintf("%02d", 1), "&EventId=31") #[1] "http://example.com/orResultsED.cfm?MODE=exED&ED=01&EventId=31" 

In the loop,

for(i in 1:87) { i1 <- sprintf('%02d', i) url <- paste0("http://example.com/orResultsED.cfm?MODE=exED&ED=", i1, "&EventId=31") file <- paste0("Data/myExcel_", i, ".xlsx") if (!file.exists(file)) download.file(url, file) } 

Assuming that the files are downloaded in the working directory

files <- list.files(full.names = TRUE) library(openxlsx) library(purrr) library(dplyr) map(files, ~read.xlsx(.x, sheet = 1, startRow = 3)) %>% bind_rows 

Or as @hrbrmstr mentioned in the comments, map_df can be used which returns a single dataset

map_df(files, ~read.xlsx(.x, sheet = 1, startRow = 3)) 

Update

Based on the comments from OP, there seems to be a difference in column class for some of the datasets. In that case, bind_rows gives an error. One option is to use rbindlist from data.table

map(files, ~read.xlsx(.x, sheet = 1, startRow = 3)) %>% data.table::rbindlist(fill = TRUE) 
Sign up to request clarification or add additional context in comments.

8 Comments

if you're already using purrr why not map_df? (serious q … curious if you've noticed performance issues or odd behaviour/etc).
@hrbrmstr I thought about that. But, if there are unequal number of columns it could break then bind_rows have the option to fill
hrm. map_df() has always filled for me, too, but I also try to only use it with list output any more (it's noticeably faster than data frames, esp if returning alot of small ones)
Thanks! This almost works. See update to post... yes I have unequal no. of col.
@ThomasSpeidel I think it is an issue with the column class difference in one of the list elements. For e.g. df1 <- data.frame(A = 1:5, B = as.character(6:10), stringsAsFactors = FALSE); df2 <- data.frame(A = 7:12, B = 8:13) bind_rows(df1, df2)# Error in bind_rows_(x, .id) You can use rbindlist from data.table ie. rbindlist(list(df1, df2), fill = TRUE) Column B can't be converted from character to integer
|
3

downloading and reading in 1 loop. Hopefully, the columns are aligned if not use something like plyr::rbind.fill instead of do.call(rbind, list)

do.call(rbind, lapply(1:87, function(n) { url <- paste0("http://example.com/orResultsED.cfm?MODE=exED&ED=", sprintf("%02d", n), "&EventId=31") file <- paste0("Data/myExcel_", n, ".xlsx") if (!file.exists(file)) download.file(url, file) readxl::read_excel(file, skip=2) Sys.sleep(5) })) 

Comments

2

you can also use regmatches

 num=sprintf("%02.0f",1:87) urls=rep("http://example.com/orResultsED.cfm?MODE=exED&ED=01&EventId=31",87) `regmatches`(urls,regexpr("\\d+",urls))<-num urls[87] [1] "http://example.com/orResultsED.cfm?MODE=exED&ED=87&EventId=31" 

To have all the files:

 files <- paste0("Data/myExcel_",num , ".xlsx") 

to download the files:

 mapply(function(x,y)if(!file.exists(x))download.file(y,x),files,urls) 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.