Scraping tables on multiple web pages with rvest in R

Question

I am new to web scraping and am trying to scrape tables on multiple web pages. Here is the site: http://www.baseball-reference.com/teams/MIL/2016.shtml

I am able to scrape a table on one page rather easily using rvest. There are multiple tables, but I only wanted to scrape the first one, here is my code

library(rvest) url4 <- "http://www.baseball-reference.com/teams/MIL/2016.shtml" Brewers2016 <- url4 %>% read_html() %>% html_nodes(xpath = '//*[@id="div_team_batting"]/table[1]') %>% html_table() Brewers2016 <- as.data.frame(Brewers2016)

The problem is that I want to scrape the first table on the page dating back to 1970. There is a link specifying the previous year at the top left corner just above the table. Does anybody know how I can do this?

I am also open to different ways of doing this, for example, a package other than rvest that might work better. I used rvest because it's the one I started learning.

I'm not going to sift through them to find the perfect dup but there are multiple answers to this if you had just simply searched stackoverflow.com/… — hrbrmstr
– hrbrmstr, Commented Oct 19, 2016 at 20:10

JasonAizkalns · Accepted Answer · 2016-10-19 19:51:23Z

One way would be to make vector of all the urls you are interested in and then use sapply:

library(rvest) years <- 1970:2016 urls <- paste0("http://www.baseball-reference.com/teams/MIL/", years, ".shtml") # head(urls) get_table <- function(url) { url %>% read_html() %>% html_nodes(xpath = '//*[@id="div_team_batting"]/table[1]') %>% html_table() } results <- sapply(urls, get_table)

results should be a list of 47 data.frame objects; each should be named with the url (i.e., year) they represent. That is, results[1] corresponds to 1970, and results[47] corresponds to 2016.

Collectives™ on Stack Overflow

Scraping tables on multiple web pages with rvest in R

1 Answer 1

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Linked

Related