Recursive Web Scraping with Python Beautiful Soup

Question

I wrote a short program which should allow a user to specify a starting page in Discogs Wiki Style Guide, scrape the other styles listed on the page, and then output a graph (represented here as a dictionary of sets) of the relationship between subgenres.

I'm looking for guidance/critique on: (1) How to clean up the request_page function, I think there is a more elegant way both getting href attrs and filtering to only those with "/style/". (2) The general structure of the program. Self-taught and relative beginner so it's highly appreciated if anyone could point out general irregularities.

import re import requests from bs4 import BeautifulSoup def get_related_styles(start): def request_page(start): response = requests.get('{0}{1}'.format(base_style_url, start)) soup = BeautifulSoup(response.content,'lxml') ## these lines feel inelegant. considered solutions with ## soup.findAll('a', attrs = {'href': pattern.match}) urls = [anchor.get('href') for anchor in soup.findAll('a')] pattern = re.compile('/style/[a-zA-Z0-9\-]*[^/]') # can use lookback regex w/ escape chars? style_urls = {pattern.match(url).group().replace('/style/','') for url in urls if pattern.match(url)} return style_urls def connect_styles(start , style_2): ## Nodes should not connect to self ## Note that styles are directed - e.g. (A ==> B) =/=> (B ==> A) if start != style_2: if start not in all_styles.keys(): all_styles[start] = {style_2} else: all_styles[start].add(style_2) if style_2 not in do_not_visit: do_not_visit.add(style_2) get_related_styles(style_2) style_urls = request_page(start) for new_style in style_urls: connect_styles(start,new_style)

Example Use:

start = 'Avant-garde-Jazz' base_style_url = 'https://reference.discogslabs.com/style/' all_styles = {} do_not_visit = {start} get_related_styles(start) print(all_styles) {'Free-Jazz': {'Free-Improvisation', 'Free-Funk'}, 'Free-Improvisation': {'Free-Jazz', 'Avant-garde-Jazz'}, 'Avant-garde-Jazz': {'Free-Jazz'}, 'Free-Funk': {'Free-Jazz'}}

alecxe · Accepted Answer · 2018-01-04 04:30:53Z

There is a simpler way to filter out the "style" links - using a CSS selector with a partial match on the href attribute:

style_urls = {anchor['href'].replace('/style/', '') for anchor in soup.select('a[href^="/style/"]')]

where ^= means "starts with".

Here we, of course, lose the check we had on the style name part of the href. If this check is really needed, we can also use a regular expression to match the desired style links directly:

pattern = re.compile('/style/([a-zA-Z0-9\-]*)[^/]') style_urls = {pattern.search(anchor['href']).group(1) for anchor in soup('a', href=pattern)

soup() here is a short way of doing soup.find_all().

Stack Exchange Network

Recursive Web Scraping with Python Beautiful Soup

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Recursive Web Scraping with Python Beautiful Soup

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions