OOP Web-scraper w/ Python and BeautifulSoup

Question

This is my first major web scraping program in python. My code works nonetheless, I'm just not sure if it's the best OOP design. My code is below:

from bs4 import BeautifulSoup import requests import argparse import sys class ComicScraper(): # Class ComicScraper for scraping comic books def __init__(self, comic_titles, comic_prices, all_comics): self.comic_titles = comic_titles self.comic_prices = comic_prices self.all_comics = all_comics # url of comicbook site self.url = 'https://leagueofcomicgeeks.com/comics/new-comics/2020/' self.webpage = requests.get(self.url) # HTTP request for url # BeautifulSoup object of webpage self.soup = BeautifulSoup(self.webpage.content, 'html.parser') self.titles = list( map(BeautifulSoup.get_text, self.soup.find_all('div', class_='comic-title'))) self.comicinfo = [x.replace(u'\xa0', u'').strip() for x in list(map(BeautifulSoup.get_text, self.soup.find_all('div', class_='comic-details comic-release'))) ] self.prices = [ prices[-5:] if prices[-5:].startswith('$') else 'No price' for prices in self.comicinfo] def main(self): if len(sys.argv) == 1: print("###### New Comics ######") for title, info in zip(self.titles, self.comicinfo): print(title, '--->', info) if self.all_comics: print("###### New Comics ######") for titles, info in zip(self, titles, self.comicinfo): print(title, '--->', info) if self.comic_titles and self.comic_prices: print("###### New Comics ######") for title, prices in zip(self.titles, self.prices): print(title, '--->', info) if self.comic_titles: for comic_title in self.comic_titles: print(comic_title) if self.comic_prices: for dol_amount in comic_prices: print(dol_amount) if __name__ == '__main__': parser = argparse.ArgumentParser() # Titles of comicbooks i.e "Detective Comics #1" parser.add_argument('-t', '--titles', help='Print comic titles ONLY', dest='titles') # Scrape prices of comic books in order parser.add_argument('-m', '--prices', help='Get comic prices ONLY', dest='prices') parser.add_argument('-a', '--all', help='Get titles, prices, publisher, and descriptions', dest='all_comics', action='store_true') args = parser.parse_args() scraper = ComicScraper(args.titles, args.prices, args.all_comics) scraper.main()

I have some doubts about how much instance variables I've used? Would refactoring this code as a bunch of functions be the best way?

Reinderien · Accepted Answer · 2020-04-10 02:46:27Z

Bugs

Lines 32 and 42:

 for titles, info in zip(self, titles, self.comicinfo): for dol_amount in comic_prices:

both have unresolved variable references - titles and comic_prices.

Bare class declaration

class ComicScraper(): # Class ComicScraper for scraping comic books

can be

class ComicScraper: """ For scraping comic books """

Note the more common format for docstrings used.

Doing too much in an `init`

requests.get(self.url)

should probably not be done in an __init__. Constructors are usually best to initialize everything that the class will need without "doing" too much.

Argument names

titles doesn't actually accept multiple titles; it only accepts one. Providing the argument twice overwrites the first value. That means that this loop:

 for comic_title in self.comic_titles:

is actually going to loop through each of the characters in the provided string, which is probably not what you want.

Reading the docs - https://docs.python.org/3/library/argparse.html - you probably want action='append'.

Reaching past `argparse`

This:

 if len(sys.argv) == 1:

should not be done. Instead, rely on the output of argparse.

Suggested

Here is a re-thought program:

from bs4 import BeautifulSoup, Tag from datetime import date, datetime from typing import Iterable import argparse import re from requests import Session class Comic: # · Apr 8th, 2020 · $7.99 RELEASE_PAT = re.compile( r'^\s*·\s*' r'(?P<month>\S+)\s*' r'(?P<day>\d+)\w*?,\s*' r'(?P<year>\d+)\s*' r'(·\s*\$(?P<price>[0-9.]+))?\s*$' ) def __init__(self, item: Tag): self.id = int(item['id'].split('-')[1]) sku = item.select_one('.comic-diamond-sku') if sku: self.sku: str = sku.text.strip() else: self.sku = None consensus_head = item.find(name='span', text=re.compile('CONSENSUS:')) if consensus_head: self.consensus = float(consensus_head.find_next_sibling().strong.text) else: self.consensus = None potw_head = item.find(name='span', text=re.compile('POTW')) self.pick_of_the_week = float(potw_head.find_next_sibling().text.rstrip('%')) title_anchor = item.select_one('.comic-title > a') self.title: str = title_anchor.text self.link = title_anchor['href'] details = item.select_one('.comic-details') self.publisher: str = details.strong.text parts = self.RELEASE_PAT.match(list(details.strings)[2]).groupdict() self.pub_date: date = ( datetime.strptime( f'{parts["year"]}-{parts["month"]}-{parts["day"]}', '%Y-%b-%d' ) .date() ) price = parts.get('price') if price is None: self.price = price else: self.price = float(price) self.desc: str = list(item.select_one('.comic-description > p').strings)[0] class ComicScraper: URL = 'https://leagueofcomicgeeks.com/' def __init__(self): self.session = Session() def __enter__(self): return self def __exit__(self, exc_type, exc_val, exc_tb): self.session.close() @staticmethod def _parse(content: str) -> Iterable[Comic]: soup = BeautifulSoup(content, 'html.parser') list_items = soup.select('#comic-list > ul > li') return (Comic(li) for li in list_items) def get_from_page(self) -> Iterable[Comic]: with self.session.get(self.URL + 'comics/new-comics') as response: response.raise_for_status() return self._parse(response.content) def get_from_xhr(self, req_date: date) -> Iterable[Comic]: params = { 'addons': 1, 'list': 'releases', 'list_option': '', 'list_refinement': '', 'date_type': 'week', 'date': f'{req_date:%d/%m/%Y}', 'date_end': '', 'series_id': '', 'user_id': 0, 'title': '', 'view': 'list', 'format[]': (1, 6), 'character': '', 'order': 'pulls', } with self.session.get(self.URL + 'comic/get_comics', params=params) as response: response.raise_for_status() return self._parse(response.json()['list']) def print_comics(comics: Iterable[Comic]): print(f'{"Title":40} {"Publisher":20} {"Date":10} {"Price":6}') for c in comics: print( f'{c.title[:40]:40} {c.publisher[:20]:20} ' f'{c.pub_date}', end=' ' ) if c.price is not None: print(f' ${c.price:5.2f}', end='') print() def main(): parser = argparse.ArgumentParser() # Titles of comicbooks i.e "Detective Comics #1" parser.add_argument('-t', '--titles', help='Print these comic titles ONLY', action='append') args = parser.parse_args() titles = args.titles and set(args.titles) with ComicScraper() as scraper: comics = scraper.get_from_xhr(date(year=2020, month=3, day=25)) if titles: comics = (c for c in comics if c.title in titles) print_comics(comics) if __name__ == '__main__': main()

Points:

Your original class is not useful as a class - it's effectively a function; the important stuff to capture in a class is distinct fields on the data you're trying to represent
You can still class-ify the generator as a class method
Use type hints
Use set membership check for titles
Use a regex to parse the info field
Pull out price and date as a float and date, respectively; don't leave them stringly-typed

Note the second method to use an XHR backend instead of the web front-end. The return format is awkward - they return rendered HTML as a part of the JSON payload - but the interface is more powerful and the method might be more efficient. I have not done a lot of investigation into what each of those parameters means; to learn more you will probably have to dig around the site using developer tools.

Thanks for the answer, it’s much appreciated! I just have a few questions: What does “Iterable” do?Which is better parsing with re or BeautifulSoup? — Practical1
– Practical1, Commented Apr 10, 2020 at 1:19
What does “Iterable” do? It's a type hint that gives a promise for the type to be - literally - an iterable of a certain element type. It does not make promises about the ability to call len, or mutability, or durability. — Reinderien
– Reinderien, Commented Apr 10, 2020 at 1:24
Which is better parsing with re or BeautifulSoup? Generally BS if you can, but sometimes that's not enough and semantic information is mixed into a single string, in which case regexes can help. — Reinderien
– Reinderien, Commented Apr 10, 2020 at 1:25
Edited for narrower, better-structured use of BS tree navigation. — Reinderien
– Reinderien, Commented Apr 10, 2020 at 1:42
So, “Iterable” is just a sanity check to make sure that the argument passed is a iterable, am I right? — Practical1
– Practical1, Commented Apr 10, 2020 at 6:18

Kate · Accepted Answer · 2020-04-10 02:40:01Z

Just one small contribution from me: I think your utilization of BeautifulSoup is not optimal. For example this bit of code is wasteful as it does not warrant using the map function:

self.titles = list( map(BeautifulSoup.get_text, self.soup.find_all('div', class_='comic-title')))

What does the map function do ? From the documentation (emphasis is mine):

map(function, iterable, ...)

Return an iterator that applies function to every item of iterable, yielding the results. If additional iterable arguments are passed, function must take that many arguments and is applied to the items from all iterables in parallel. With multiple iterables, the iterator stops when the shortest iterable is exhausted...

A more straightforward of getting the same result (and trimming text) would be:

self.titles = [title.get_text().strip() for title in self.soup.find_all('div', class_='comic-title')]

Or:

self.titles = [title.get_text(strip=True) for title in self.soup.find_all('div', class_='comic-title')]

And there is no need to involve BeautifulSoup.get_text either. You've already loaded the soup, once is enough.

Another thing:

self.comicinfo = [x.replace(u'\xa0', u'').strip() for x in list(map(BeautifulSoup.get_text, self.soup.find_all('div', class_='comic-details comic-release'))) ]

Here you are trying to get rid of the non-breaking space
Although we are dealing we just one pesky character you might encounter more unwanted 'characters' in the future when scraping UTF-8 encoded pages.

Based on several posts like this one and this one a possible strategy is to use the unicodedata.normalize function to derive canonical representations of those strings. Since the closest representation of a non-breaking space is of course a plain space, then we want a plain space.

In short this will give a cleaned-up string that is more usable:

unicodedata.normalize("NFKD", 'Archie Comics·\xa0 Apr 8th, 2020 \xa0·\xa0 $7.99') # output: 'Archie Comics· Apr 8th, 2020 · $7.99'

(and there using the map function makes sense I think)

The cost is importing one more dependency: import unicodedata Admittedly that is not so easy to grasp and even experienced developers are having headaches with processing of Unicode text and character set conversions. But you can't really avoid those issues when doing scraping jobs, they will always torment you.

One more reference on the topic: Unicode equivalence

Stack Exchange Network

OOP Web-scraper w/ Python and BeautifulSoup

2 Answers 2

Bugs

Bare class declaration

Doing too much in an `init`

Argument names

Reaching past `argparse`

Suggested

You must log in to answer this question.

Hot Network Questions

OOP Web-scraper w/ Python and BeautifulSoup

2 Answers 2

Bugs

Bare class declaration

Doing too much in an init

Argument names

Reaching past argparse

Suggested

You must log in to answer this question.

Related

Hot Network Questions

Doing too much in an `init`

Reaching past `argparse`