RVest Web Scraping of Multiple URLs (hopefully easy question)

Question

I'm a total rookie web scraper so apologies for the basic question, but I have searched around and struggled when trying to apply previous answers on here. I am trying to scrape multiple related URLs on fbref.com (a subset of Sports Reference) but running into an issue on I think using lapply properly. I can successfully pull one URL, just not all at once.

Here is the gist of what I'm trying to do:

library("rvest") library("tidyverse") year1 <- paste0(2006:2021) year2 <- paste0(2007:2022) urls <- sort(rep(paste0("https://fbref.com/en/comps/Big5/", year1, "-", year2, "/stats/players/", year1, "-", year2, "-Big-5-European-Leagues-Stats"))) table <- read_html(urls) |> html_nodes("table") |> html_table()

I think I just need to lapply loop that last section, but I am struggling to get the formatting right. When using the last section to read ONE of the URLs by purely pasting one URL, like below, I get the output I want. I simply want this for all years beginning with 2006-07 through 2021-22, in one csv file.

> url <- "https://fbref.com/en/comps/Big5/2021-2022/stats/players/2021-2022-Big-5-European-Leagues-Stats" > table <- read_html(url) |> + html_nodes("table") |> + html_table() > write.csv(table, file = "fbrefinitial.csv")

From there, I think I just need to use bind_rows along with either year1 or year2 to add a column for each year, as I would like to get this all in one tab of one csv file. (What's the right way to format that command?)

This is most similar to this post, but my attempts to apply that logic in different ways is not working.

Thank you for your help!

Allan Cameron · Accepted Answer · 2023-03-11 20:17:50Z

You can do:

lapply(urls, function(url) { read_html(url) |> html_nodes("table") |> html_table() }) #> [[1]] #> [[1]][[1]] #> # A tibble: 2,687 x 29 #> `` `` `` `` `` `` `` `` Playi~1 Playi~2 Playi~3 Playi~4 #> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> #> 1 Rk Player Nati~ Pos Squad Comp Age Born MP Starts Min 90s #> 2 1 Dani Aba~ es E~ FW,MF Celt~ es L~ 18 1987 1 0 13 0.1 #> 3 2 Jacques ~ fr F~ DF Nice fr L~ 28 1978 30 28 2,492 27.7 #> 4 3 Christia~ it I~ GK Tori~ it S~ 29 1977 36 36 3,235 35.9 #> 5 4 Pato Abb~ ar A~ GK Geta~ es L~ 33 1972 36 36 3,215 35.7 #> 6 5 Elvis Ab~ it I~ FW Tori~ it S~ 25 1981 29 15 1,432 15.9 #> 7 6 Nadjim A~ km C~ MF Sedan fr L~ 22 1984 17 11 1,136 12.6 #> 8 7 Nelson A~ uy U~ MF Atal~ it S~ 33 1973 5 2 121 1.3 #> 9 8 Mathias ~ de G~ DF Hamb~ de B~ 25 1981 8 4 416 4.6 #> 10 9 Éric Abi~ fr F~ DF Lyon fr L~ 26 1979 33 31 2,750 30.6 #> # ... with 2,677 more rows, 17 more variables: Performance <chr>, Performance <chr>, #> # Performance <chr>, Performance <chr>, Performance <chr>, Performance <chr>, #> # Performance <chr>, Performance <chr>, Progression <chr>, Progression <chr>, #> # Progression <chr>, `Per 90 Minutes` <chr>, `Per 90 Minutes` <chr>, #> # `Per 90 Minutes` <chr>, `Per 90 Minutes` <chr>, `Per 90 Minutes` <chr>, `` <chr>, #> # and abbreviated variable names 1: `Playing Time`, 2: `Playing Time`, #> # 3: `Playing Time`, 4: `Playing Time` #> # i Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names #> #> #> [[2]] #> [[2]][[1]] #> # A tibble: 2,770 x 29 #> `` `` `` `` `` `` `` `` Playi~1 Playi~2 Playi~3 Playi~4 #> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> #> 1 Rk Player Nati~ Pos Squad Comp Age Born MP Starts Min 90s #> 2 1 Jacques ~ fr F~ DF Nice fr L~ 29 1978 10 4 434 4.8 #> 3 2 Jacques ~ fr F~ DF Nürn~ de B~ 29 1978 10 9 820 9.1 #> 4 3 Ignazio ~ it I~ DF,MF Empo~ it S~ 20 1986 24 9 1,167 13.0 #> 5 4 Christia~ it I~ GK Atlé~ es L~ 30 1977 21 20 1,804 20.0 #> 6 5 Pato Abb~ ar A~ GK Geta~ es L~ 34 1972 34 34 3,046 33.8 #> 7 6 Yacine A~ ma M~ MF Stra~ fr L~ 26 1981 23 17 1,549 17.2 #> 8 7 Damià Ab~ es E~ DF,MF Betis es L~ 25 1982 26 24 2,230 24.8 #> 9 8 Éric Abi~ fr F~ DF Barc~ es L~ 27 1979 30 28 2,523 28.0 #> 10 9 Ahmed Ab~ eg E~ DF,MF Stra~ fr L~ 26 1981 2 1 91 1.0 #> # ... with 2,760 more rows, 17 more variables: Performance <chr>, Performance <chr>, #> # Performance <chr>, Performance <chr>, Performance <chr>, Performance <chr>, #> # Performance <chr>, Performance <chr>, Progression <chr>, Progression <chr>, #> # Progression <chr>, `Per 90 Minutes` <chr>, `Per 90 Minutes` <chr>, #> # `Per 90 Minutes` <chr>, `Per 90 Minutes` <chr>, `Per 90 Minutes` <chr>, `` <chr>, #> # and abbreviated variable names 1: `Playing Time`, 2: `Playing Time`, #> # 3: `Playing Time`, 4: `Playing Time` #> # i Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names #> #> #> [[3]] #> [[3]][[1]] #> # A tibble: 2,796 x 29 #> `` `` `` `` `` `` `` `` Playi~1 Playi~2 Playi~3 Playi~4 #> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> #> 1 Rk Player Nati~ Pos Squad Comp Age Born MP Starts Min 90s #> 2 1 Jacques ~ fr F~ DF Vale~ fr L~ 30 1978 18 14 1,252 13.9 #> 3 2 Ignazio ~ it I~ DF,MF Tori~ it S~ 21 1986 25 21 1,913 21.3 #> 4 3 Christia~ it I~ GK Milan it S~ 31 1977 28 28 2,441 27.1 #> 5 4 Pato Abb~ ar A~ GK Geta~ es L~ 35 1972 13 13 1,083 12.0 #> 6 5 Elvis Ab~ it I~ FW Tori~ it S~ 27 1981 10 2 388 4.3 #> 7 6 Djamel A~ dz A~ MF Nant~ fr L~ 22 1986 22 12 1,139 12.7 #> 8 7 Damià Ab~ es E~ DF,MF Betis es L~ 26 1982 25 20 1,788 19.9 #> 9 8 Éric Abi~ fr F~ DF Barc~ es L~ 28 1979 25 25 2,116 23.5 #> 10 9 Fabrice ~ fr F~ MF Lori~ fr L~ 29 1979 35 35 3,060 34.0 #> # ... with 2,786 more rows, 17 more variables: Performance <chr>, Performance <chr>, #> # Performance <chr>, Performance <chr>, Performance <chr>, Performance <chr>, #> # Performance <chr>, Performance <chr>, Progression <chr>, Progression <chr>, #> # Progression <chr>, `Per 90 Minutes` <chr>, `Per 90 Minutes` <chr>, #> # `Per 90 Minutes` <chr>, `Per 90 Minutes` <chr>, `Per 90 Minutes` <chr>, `` <chr>, #> # and abbreviated variable names 1: `Playing Time`, 2: `Playing Time`, #> # 3: `Playing Time`, 4: `Playing Time` #> # i Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names #>

Thanks! Still not quite getting it over the line, I think because I need to bind them correctly: > fbref_stats <- lapply(urls, function(url) { + read_html(url) |> + html_nodes("table") |> + html_table() + }) > write.csv(fbref_stats, file = "fbreftest4.csv") Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 2687, 2770, 2796, 2813, 2818, 2844, 2867, 2861, 2799, 2882, 2840, 2763, 2842, 2935, 3038 Trying bind_rows(fbref_stats, .id = "year2") and variants of that but can't get it quite right
@BobH , assuming you are interested in 1 table per page, as html_nodes("table") returns list (even if there's just one element), html_table() also returns a list and your fbref_stats list will include one extra level. If you change html_nodes to html_nodeor use some other means to get just a single tibble per each lapply iteration, you should be able to use bind_rows(fbref_stats) or do.call(rbind, fbref_stats).

Collectives™ on Stack Overflow

RVest Web Scraping of Multiple URLs (hopefully easy question)

1 Answer 1

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Linked

Related