5
\$\begingroup\$

I need cheap GPU's and decided to make a little ebay-scraper to make sure I can buy it at cheap prices. It works and I can search for (any) product to scrape the Title, Price and Shipping cost of that product.

But I feel like I haven't done it in a generic way, hence the need to post it here. Thus my question is, could this have been done more generic?


Code

from bs4 import BeautifulSoup from urllib.request import urlopen def find_next_page(soup): ''' Finds next page Returns: Next page link if exists None if next page does not exists ''' next_page = soup.find('a', {'class': 'gspr next'}) try: return next_page['href'] except: return None def scrape_page(soup, idx): ''' Scrape page for products and save them as a dictionary Returns: A dictionary of products ''' products = {} prod_idx = idx for ultag in soup.find_all('ul', {'id': 'ListViewInner'}): for litag in ultag.find_all('li'): title = litag.find('a', {'class': 'vip'}) if not title is None: products[prod_idx] = {} title_text = title.text if 'New listing' in title_text: title_text = title_text.replace('New listing', '').lstrip() title_text.strip() products[prod_idx]['Title'] = title_text ul = litag.find('ul', {'class': 'lvprices left space-zero'}) if not ul is None: for li in ul.find_all('li'): if '$' in li.text and not 'shipping' in li.text.lower(): products[prod_idx]['Price'] = li.text.split()[0] if 'shipping' in li.text.lower(): products[prod_idx]['Shipping'] = li.text.strip() prod_idx += 1 return products, prod_idx def table_print(products): ''' Prints products in nice human-readable format ''' print ("{:<8} {:<100} {:<15} {:<30}".format('Key', 'Product', 'Price', 'Shipping')) for k, v in products.items(): try: t, p, s = v print ('{:<8} {:<100} {:<15} {:<30}'.format(k, products[k][t], products[k][p], products[k][s])) except ValueError: continue def scrape_product(ebay_page): ''' Main scraper ''' products = {} idx = 0 while not ebay_page is None: html_doc = urlopen(ebay_page) soup = BeautifulSoup(html_doc, 'html.parser') prod, idx = scrape_page(soup, idx) products.update(prod) ebay_page = find_next_page(soup) return products def make_search_string(keywords): ''' Make ebay search products string ''' base = 'https://www.ebay.com/sch/i.html?_from=R40&_trksid=m570.l1313&_nkw=' end = '&_sacat=0' return '{0}{1}{2}'.format(base, '+'.join(keywords), end) if __name__ == '__main__': print ('Project I Need a Cheap GPU (Ebay-Scraper in Python3) \n@Ludisposed \n') keywords = input('What do you want to search ebay for? ').split() # Make search string ebay_page = make_search_string(keywords) # Find all products products = scrape_product(ebay_page) # Print all products table_print(products) 

Example

Project I Need a Cheap GPU (Ebay-Scraper in Python3) @Ludisposed What do you want to search ebay for? amd 580 Key Product Price Shipping 0 XFX AMD Radeon RX 580 8GB GDDR5 GTR Black Edition PCI Express 3.0 New Sealed $305.00 +$37.83 shipping 1 MSI AMD Radeon RX 580 GAMING X 4G GDDR5 DVI/2HDMI/2Displayport PCI-Express Video $336.99 +$46.19 shipping etc.... 
\$\endgroup\$

1 Answer 1

5
\$\begingroup\$

There are multiple things to improve, let's try to categorize them.

Code Style

  • there is a number of PEP8 code style violations, specifically:
  • handling a bare exception with except: is considered a bad practice
  • when you check for a substring to be present in a string, you can use not in, e.g. not 'shipping' in li.text.lower() can be replaced with a more readable:

    'shipping' not in li.text.lower() 
  • I would also improve the way you format the search URL, extracting the base url as a constant and using str.format() to insert keywords into the URL:

    BASE_URL_TEMPLATE = 'https://www.ebay.com/sch/i.html?_from=R40&_trksid=m570.l1313&_nkw={keywords}&_sacat=0' def make_search_string(keywords): """Constructs a search products string.""" return BASE_URL_TEMPLATE.format(keywords='+'.join(keywords)) 

Web-Scraping and HTML Parsing

  • if you would switch to requests instead of urllib, you then can seriously improve on the performance of making the requests by re-using a session:

    ..if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase.

  • lxml if used instead of html.parser can provide an HTML parsing performance boost as well
  • see if using a SoupStrainer to focus your HTML parser on the desired part of the HTML tree only is also applicable in your scenario. It can further speed up the HTML parsing step
  • using a LBYL approach inside the find_next_page would be a bit more readable and concise:

    def find_next_page(soup): """ Finds next page. Returns: Next page link if exists None if next page does not exists """ next_page_link = soup.select_one('a.next') return next_page_link['href'] if next_page_link else None 
  • scrape_page() function is overly complicated. For instance, there is a single element with id="ListViewInner" on the page and it does not require a loop to locate this element. And, you can do better in locating elements on the page:

    def scrape_page(soup, index): """ Scrapes page for products and save them as a dictionary. Returns: A dictionary of products """ products = {} for produce_index, item in enumerate(soup.select('#ResultSetItems li'), index): title = item.select_one('a.vip') price = item.select_one(".prc") shipping_price = item.select_one('.ship') products[produce_index] = { 'Title': title.find(text=True, recursive=False) if title else None, 'Price': price.get_text(strip=True) if price else None, 'Shipping': shipping_price.get_text(strip=True).split()[0] if shipping_price else None } return products, produce_index 

    (not tested)

Also note the use of enumerate() for better handling enumeration (thanks @graipher for the idea).

\$\endgroup\$
4
  • \$\begingroup\$ I knew I was overcomplicating the scrape_page(). + I tried it with requests but I failed somewhere, therefore I used something else to make it work... Damn this means I need to tinker with bs4 some more. \$\endgroup\$ Commented Sep 18, 2017 at 14:33
  • \$\begingroup\$ @Graipher oh, good idea, completely missed that putting too much focus on the HTML parsing part :) Thanks! \$\endgroup\$ Commented Sep 19, 2017 at 15:09
  • 2
    \$\begingroup\$ And one more thing: Now it becomes really clear that products is just a structure which maps sequential integers starting at zero to some object. So it is basically a list and you don't need a dict at all. Just append to a list, or even better, make scrape_page a generator and then outside of it extend the list with it. \$\endgroup\$ Commented Sep 19, 2017 at 15:12
  • \$\begingroup\$ enumerate doesn't wotk well, because it scrapes empty products. So I have it outside and check if title is not None. If check passes I add 1 to avoid empty products. Secondly I like data structure as a dictionary to iterate only over titles for instance. With list it is also possible, but less readable imo. \$\endgroup\$ Commented Sep 19, 2017 at 16:12

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.