Loop through url's with R

Question

I need to download a series of Excel files from URL's that all look as follows:

http://example.com/orResultsED.cfm?MODE=exED&ED=01&EventId=31 http://example.com/orResultsED.cfm?MODE=exED&ED=02&EventId=31 ... http://example.com/orResultsED.cfm?MODE=exED&ED=87&EventId=31

I've got some of the building blocks inside the loop, such as:

for(i in 1:87) { url <- paste0("http://example.com/orResultsED.cfm?MODE=exED&ED=", i, "&EventId=31") file <- paste0("Data/myExcel_", i, ".xlsx") if (!file.exists(file)) download.file(url, file) }

My problems:

I need the seq to prepend the 0 (I tried sprintf with no luck)
I also want to import the Excel files, skip the first two rows and append them on after the other (they also have the same columns)

Update

@akrun solution works well. But it turns out not all my Excel files have the same number of columns:

map(files, ~read.xlsx(.x, colNames = FALSE, sheet = 1, startRow = 4, )) %>% bind_rows Error in bind_rows_(x, .id) : Column `X1` can't be converted from numeric to character

I think this error actually points to the unequal number of column. I tried adding fill = NA (when testing map_df()), but it didn't help.

FWIW, provided you're not trying to make $, "elections alberta" actually goes out of its way to inform you they allow personal and educational scraping. Please try to be kind and add in a Sys.sleep(5) tho. There's no need to download files faster than that.There's also a really good chance one email could end up giving you a ZIP file or even a set of SQL to load these up w/o scraping. But, hey, nice try with example.com. — hrbrmstr
– hrbrmstr, Commented Jan 29, 2018 at 1:30
Thanks! I'm pretty new to web scraping. I didn't know one could ask for SQL access. The example.com was more to shorten the link and make it fit in one line in SO than anything else! — Thomas Speidel
– Thomas Speidel, Commented Jan 29, 2018 at 1:41
Ohh and I learned the if (!file.exists(file)) download.file(url, file) from your nuclear animation example! Thanks for chiming in! — Thomas Speidel
– Thomas Speidel, Commented Jan 29, 2018 at 1:45

akrun · Accepted Answer · 2018-01-29 05:45:39Z

We can create it with sprintf

paste0("http://example.com/orResultsED.cfm?MODE=exED&ED=", sprintf("%02d", 1), "&EventId=31") #[1] "http://example.com/orResultsED.cfm?MODE=exED&ED=01&EventId=31"

In the loop,

for(i in 1:87) { i1 <- sprintf('%02d', i) url <- paste0("http://example.com/orResultsED.cfm?MODE=exED&ED=", i1, "&EventId=31") file <- paste0("Data/myExcel_", i, ".xlsx") if (!file.exists(file)) download.file(url, file) }

Assuming that the files are downloaded in the working directory

files <- list.files(full.names = TRUE) library(openxlsx) library(purrr) library(dplyr) map(files, ~read.xlsx(.x, sheet = 1, startRow = 3)) %>% bind_rows

Or as @hrbrmstr mentioned in the comments, map_df can be used which returns a single dataset

map_df(files, ~read.xlsx(.x, sheet = 1, startRow = 3))

Update

Based on the comments from OP, there seems to be a difference in column class for some of the datasets. In that case, bind_rows gives an error. One option is to use rbindlist from data.table

map(files, ~read.xlsx(.x, sheet = 1, startRow = 3)) %>% data.table::rbindlist(fill = TRUE)

if you're already using purrr why not map_df? (serious q … curious if you've noticed performance issues or odd behaviour/etc).
@hrbrmstr I thought about that. But, if there are unequal number of columns it could break then bind_rows have the option to fill
hrm. map_df() has always filled for me, too, but I also try to only use it with list output any more (it's noticeably faster than data frames, esp if returning alot of small ones)
Thanks! This almost works. See update to post... yes I have unequal no. of col.
@ThomasSpeidel I think it is an issue with the column class difference in one of the list elements. For e.g. df1 <- data.frame(A = 1:5, B = as.character(6:10), stringsAsFactors = FALSE); df2 <- data.frame(A = 7:12, B = 8:13) bind_rows(df1, df2)# Error in bind_rows_(x, .id) You can use rbindlist from data.table ie. rbindlist(list(df1, df2), fill = TRUE) Column B can't be converted from character to integer

chinsoon12 · Accepted Answer · 2018-01-29 01:35:25Z

downloading and reading in 1 loop. Hopefully, the columns are aligned if not use something like plyr::rbind.fill instead of do.call(rbind, list)

do.call(rbind, lapply(1:87, function(n) { url <- paste0("http://example.com/orResultsED.cfm?MODE=exED&ED=", sprintf("%02d", n), "&EventId=31") file <- paste0("Data/myExcel_", n, ".xlsx") if (!file.exists(file)) download.file(url, file) readxl::read_excel(file, skip=2) Sys.sleep(5) }))

Onyambu · Accepted Answer · 2018-01-29 01:54:15Z

you can also use regmatches

 num=sprintf("%02.0f",1:87) urls=rep("http://example.com/orResultsED.cfm?MODE=exED&ED=01&EventId=31",87) `regmatches`(urls,regexpr("\\d+",urls))<-num urls[87] [1] "http://example.com/orResultsED.cfm?MODE=exED&ED=87&EventId=31"

To have all the files:

 files <- paste0("Data/myExcel_",num , ".xlsx")

to download the files:

 mapply(function(x,y)if(!file.exists(x))download.file(y,x),files,urls)

Collectives™ on Stack Overflow

Loop through url's with R

Update

3 Answers 3

Update

8 Comments

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

Update

3 Answers 3

Update

8 Comments

Comments

Comments

Related