4

I have a series of 9 urls that I would like to scrape data from:

http://www.basketball-reference.com/play-index/draft_finder.cgi?request=1&year_min=2001&year_max=2014&round_min=&round_max=&pick_overall_min=&pick_overall_max=&franch_id=&college_id=0&is_active=&is_hof=&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&c1stat=&c1comp=&c1val=&c2stat=&c2comp=&c2val=&c3stat=&c3comp=&c3val=&c4stat=&c4comp=&c4val=&order_by=year_id&order_by_asc=&offset=0 

The offset= at the end of the link goes from 0 up to 900 (by 100) when pages change through the last page. I would like to loop through each page and scrape each table, then use rbind to stack each df on top of one another in sequence. I have been using rvest and would like to use lapply since I am better with that than for loops.

The question is similar to this (Harvest (rvest) multiple HTML pages from a list of urls) but different because I would prefer not to have to copy all the links to one vector before running the program. I would like a general solution to how to loop over multiple pages and harvest the data, creating a data frame each time.

The following works for the first page:

library(rvest) library(stringr) library(tidyr) site <- 'http://www.basketball-reference.com/play-index/draft_finder.cgi?request=1&year_min=2001&year_max=2014&round_min=&round_max=&pick_overall_min=&pick_overall_max=&franch_id=&college_id=0&is_active=&is_hof=&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&c1stat=&c1comp=&c1val=&c2stat=&c2comp=&c2val=&c3stat=&c3comp=&c3val=&c4stat=&c4comp=&c4val=&order_by=year_id&order_by_asc=&offset=0' webpage <- read_html(site) draft_table <- html_nodes(webpage, 'table') draft <- html_table(draft_table)[[1]] 

But I would like to repeat this over all pages without having to paste the urls into a vector. I tried the following and it didn't work:

jump <- seq(0, 900, by = 100) site <- paste('http://www.basketball-reference.com/play-index/draft_finder.cgi?request=1&year_min=2001&year_max=2014&round_min=&round_max=&pick_overall_min=&pick_overall_max=&franch_id=&college_id=0&is_active=&is_hof=&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&c1stat=&c1comp=&c1val=&c2stat=&c2comp=&c2val=&c3stat=&c3comp=&c3val=&c4stat=&c4comp=&c4val=&order_by=year_id&order_by_asc=&offset=', jump,'.htm', sep="") webpage <- read_html(site) draft_table <- html_nodes(webpage, 'table') draft <- html_table(draft_table)[[1]] 

So there should be a data frame for each page and I imagine it would be easier to put them in a list and then use rbind to stack them.

Any help would be greatly appreciated!

7
  • do you manage to harvest even the first page? Commented Nov 17, 2016 at 22:55
  • @HubertL yes just edited the question above. The first chunk of code produces one data frame Commented Nov 17, 2016 at 23:03
  • 2
    Here is another potential solution: stackoverflow.com/questions/39129125/… Commented Nov 17, 2016 at 23:25
  • In the second version, site is a vector of URLs, so this is a dupe. Commented Nov 18, 2016 at 0:03
  • I would try download.file with mode="a" and then read all the data from a single disk file. Commented Nov 18, 2016 at 1:05

2 Answers 2

7

You are attempting to vectorize a method that cannot take multiple items in one call. Specifically, read_html() requires one page per call since it needs to read in web data one at a time and expects a scalar value. Consider looping through the site list with lapply then bind all dfs together:

jump <- seq(0, 800, by = 100) site <- paste('http://www.basketball-reference.com/play-index/draft_finder.cgi?', 'request=1&year_min=2001&year_max=2014&round_min=&round_max=', '&pick_overall_min=&pick_overall_max=&franch_id=&college_id=0', '&is_active=&is_hof=&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y', '&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&c1stat=&c1comp=&c1val=&c2stat=&c2comp=', '&c2val=&c3stat=&c3comp=&c3val=&c4stat=&c4comp=&c4val=&order_by=year_id', '&order_by_asc=&offset=', jump, sep="") dfList <- lapply(site, function(i) { webpage <- read_html(i) draft_table <- html_nodes(webpage, 'table') draft <- html_table(draft_table)[[1]] }) finaldf <- do.call(rbind, dfList) # ASSUMING ALL DFs MAINTAIN SAME COLS 
Sign up to request clarification or add additional context in comments.

7 Comments

this is really intuitive but I keep getting an error "subscript is out of bounds" when I run it.
At what line do you receive error? Does dfList populate?
it appears the error comes from the final line of dfList. dfList does not populate
Just checked. Turns out 900 has no table search results. Try leaving that number out. Highest rank is 831.
Does error persist with leaving out last url with 900?
|
2

You can use curl to run all of the requests at once. I Be nice to the sites that may have small servers and don't blow them up. With this code you can use the lapply at the end to clean up the table so you can stack it with do.call(rbind, AllOut) but I will leave that to you.

library(rvest) library(stringr) library(tidyr) OffSet <- seq(0, 900, by = 100) Sites <- paste0('http://www.basketball-reference.com/play-index/draft_finder.cgi?request=1&year_min=2001&year_max=2014&round_min=&round_max=&pick_overall_min=&pick_overall_max=&franch_id=&college_id=0&is_active=&is_hof=&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&c1stat=&c1comp=&c1val=&c2stat=&c2comp=&c2val=&c3stat=&c3comp=&c3val=&c4stat=&c4comp=&c4val=&order_by=year_id&order_by_asc=&offset=', OffSet) library(curl) out <<- list() # This is function, function which will be run if data vendor call is successful complete = function(res){ # cat("Request done! Status:", res$status, "\n") out <<- c(out, list(res)) } for(i in 1:length(Sites)){ curl_fetch_multi( Sites[i] , done = complete , fail = print , handle = new_handle(customrequest = "GET") ) } multi_run() AllOut <- lapply(out, function(x){ webpage <- read_html(x$content) draft_table <- html_nodes(webpage, 'table') Tab <- html_table(draft_table) if(length(Tab) == 0){ NULL } else { Tab } }) 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.