4
\$\begingroup\$

I've written a simple scraper that parses HTML using BeautifulSoup and collects the data (schedule of sports events), then clubs them together in a list of dicts.

The code works just fine, but the way I process the data is pretty horrible IMO. I use an if...else to parse the data selectively, because the output of the dicts are alternative, that is {venue, result (if available)}, and time of match, and team information.

So basically, is there a better, more reliable way to parse the data?

import requests import re from BeautifulSoup import BeautifulSoup from datetime import datetime class cricket(object): def getMatches(self, url): """ Scrape the given url for match schedule """ headers = {'Accept':'text/css,*/*;q=0.1', 'Accept-Charset':'ISO-8859-1,utf-8;q=0.7,*;q=0.3', 'Accept-Encoding':'gzip,deflate,sdch', 'Accept-Language':'en-US,en;q=0.8', 'User-Agent':'Mozilla/5 (Solaris 10) Gecko'} page = requests.get(url, headers = headers) page_content = page.content soup = BeautifulSoup(page_content) result = soup.find('div', attrs={'class':'bElementBox'}) tags = result.findChildren('tr') match_type_list = ['TEST', 'ODI', 'T20'] match_info = [] for elem in range(1,len(tags)): dict_ = {} x = tags[elem].getText() x = x.replace(r' ', '') if 'Venue' in x: for a in match_type_list: if a in x: match_type = a x = x.replace('Venue', '') if 'Result' in x: x = x.replace('Result', '') x = x.split(': ') # print x venue = x[1] result = x[2] dict_.update({'venue':venue,'result':result, 'match_type':match_type}) else: x = x.split(': ') venue = x[1] dict_.update({'venue':venue}) else: match = re.search(r'\b[AP]M', x) date_time = x[0:match.end()] date_time = date_time.replace(',','')[4:] teams = x[match.end():].split('vs') home_team = teams[0].strip() away_team = teams[1].strip() # print date_time, home_team, away_team time_obj = datetime.strptime(date_time, '%b %d %Y %I:%M %p') timings = time_obj.strftime('%Y-%m-%dT%H:%MZ') dict_.update({'home_team':home_team, 'away_team':away_team, 'timings':timings }) match_info.append(dict_) # print match_info final_list = [] # final list of dicts that we need for i in range(0, len(match_info), 2): final_list.append(dict(match_info[i].items() + match_info[i+1].items())) # for i in final_list: # print i # print final_list if __name__ == '__main__': url = 'http://icc-cricket.yahoo.net/match_zone/series/fixtures.php?seriesCode=ENG_WI_2012' # change seriesCode in URL for different series. #url = 'http://localhost:6543/lhost/static/icc_cricket.html' c = cricket() c.getMatches(url) 
\$\endgroup\$

2 Answers 2

5
\$\begingroup\$
import requests import re from BeautifulSoup import BeautifulSoup from datetime import datetime class cricket(object): 

Python conventions state that classes should CamelCase.

 def getMatches(self, url): 

Python convention states that methods should be lowercase_with_underscores. Also, there isn't really a reason to have this method in a class anyway. Seems to me that it should be a function.

 """ Scrape the given url for match schedule """ headers = {'Accept':'text/css,*/*;q=0.1', 'Accept-Charset':'ISO-8859-1,utf-8;q=0.7,*;q=0.3', 'Accept-Encoding':'gzip,deflate,sdch', 'Accept-Language':'en-US,en;q=0.8', 'User-Agent':'Mozilla/5 (Solaris 10) Gecko'} 

I'd move this out of the function into a global constant.

 page = requests.get(url, headers = headers) page_content = page.content soup = BeautifulSoup(page_content) 

I'd combine these three lines

soup = BeatifulSoup(request.get(url, headers = headers).content) result = soup.find('div', attrs={'class':'bElementBox'}) tags = result.findChildren('tr') 

I'd avoid non-descriptive names like result. Tags is a bit better, but not by a whole lot. I'd also combine these two lines

tags = soup.find('div', attr={'class':'bElementBox'}).findChildren('tr') match_type_list = ['TEST', 'ODI', 'T20'] match_info = [] for elem in range(1,len(tags)): 

It'd make more sense to process things two rows at a time.

 dict_ = {} 

Useless name alert.

 x = tags[elem].getText() x = x.replace(r' ', '') 

At this point, you extract the text and throw out the html. But the data is divided in table cells, so why in the world wouldn't you want to take advantage of that? At this point you drop down to trying to extract data straight from the text which much harder then if you can use the tags as hints.

 if 'Venue' in x: for a in match_type_list: if a in x: match_type = a x = x.replace('Venue', '') if 'Result' in x: x = x.replace('Result', '') x = x.split(': ') # print x venue = x[1] result = x[2] 

All this work to extract data from the string is something you really should use a regular expression for. This is exactly the kind of situation it excells at.

 dict_.update({'venue':venue,'result':result, 'match_type':match_type}) 

Its not clear why you would choose to update rather then simply assign. There is no way that any other line of code in this loop can assign to it.

 else: x = x.split(': ') venue = x[1] dict_.update({'venue':venue}) else: match = re.search(r'\b[AP]M', x) date_time = x[0:match.end()] date_time = date_time.replace(',','')[4:] teams = x[match.end():].split('vs') home_team = teams[0].strip() away_team = teams[1].strip() # print date_time, home_team, away_team time_obj = datetime.strptime(date_time, '%b %d %Y %I:%M %p') 

Organization seems a little suspect. You jump from working on the date, over to the teams, and then back to the date. I'd stick with date until it was finished. You also spend a bunch of lines massaging the date. However, strptime lets you specify any format you want, so you should just be able to have it parse the data

 timings = time_obj.strftime('%Y-%m-%dT%H:%MZ') 

If I'm parsing, I wouldn't convert the time back into a date in another object. I'd keep it as a date object.

 dict_.update({'home_team':home_team, 'away_team':away_team, 'timings':timings }) match_info.append(dict_) # print match_info final_list = [] # final list of dicts that we need for i in range(0, len(match_info), 2): final_list.append(dict(match_info[i].items() + match_info[i+1].items())) 

I'd do this:

for i in range(0, len(match_info), 2): left, right = match_info[i:i+2] final_list.append( dict(left.items() + right.items() ) 

I think it makes things a bit clearer.

 # for i in final_list: # print i # print final_list 

this function probably needs to return that list or something.

if __name__ == '__main__': url = 'http://icc-cricket.yahoo.net/match_zone/series/fixtures.php?seriesCode=ENG_WI_2012' # change seriesCode in URL for different series. #url = 'http://localhost:6543/lhost/static/icc_cricket.html' c = cricket() c.getMatches(url) 

Good

Here is my reworking of your code:

def get_cricket_matches(content): """ Scrape the given content for match schedule """ soup = BeautifulSoup(content) all_rows = soup.find('div', attrs={'class':'bElementBox'}).findChildren('tr') match_info = [] for index in range(1, len(all_rows), 2): rows = [row.findChildren('td') for row in all_rows[index:index+2]] data = { 'match_type' : rows[1][0].findChildren('b')[1].getText(), 'home_team': rows[0][2].getText(), 'away_team' : rows[0][5].getText(), 'match_time' : datetime.strptime(rows[0][0].getText(), '%a, %b %d, %Y %I:%M %p'), } for line in rows[1][1].findAll('b'): content = str(line.nextSibling)[1:] if line.getText() == 'Venue': data['venue'] = content else: data['result'] = content match_info.append(data) return match_info 

Personally, I don't like dependencies (in this case requests and BeautifulSoup). Why just not to use the standard modules?

Here I have to strongly disagree with @cat_baxter. Dependencies are the best part of python. That is, the availability and ease of installing all these different libraries to make it easier to write code is a great asset of Python. You should never be worried about taking on dependencies that are easy to install if it helps.

Having said that, you don't get a whole lot out of the requests module. So an argument can be made that this code would be better off using urllib.

But then the question is whether its a good idea to parse HTML with regex. With scraping you can get away with using a regex because regardless of the technique you use, changes to the page's structure will probably break it. The question at that point is which version has simpler and easier to write code.

Compare:

re.match('<div class="bElementBox">.+<tbody>(.+).+</tbody>.+</div>', page).group(1) 

and

soup.find('div', attrs={'class':'bElementBox'}) 

The HTML parser method takes advantage of knowing the structure of html to make it fairly easy to find the element. Is the regex entirely correctly? (Technically no, because simply regex can't parse HTML correctly, but is it good enough for job? I assume so, but I have more confidence in the HTML parser.)

columns = re.compile('<td.*?>(.*?)</td>', re.DOTALL | re.M).findall(table) 

An equivalent would be

columns = table.findall('td') 

Again, HTML parser wins hands down.

 'home_team' : columns[index + 2], 

VS:

 'home_team': rows[0][2].getText(), 

The difference between rows[0][2] and columns[index + 2] is the fact that I choose to structure the parser after the rows, whereas cat_baxter decided to throw away the rows and look just at the columns. Either approach can be done either way, so that's incidental to this question. But here the HTML parser method is slightly more complicated because it has to use .getText()

 'venue' : re.match("<b>Venue</b>: ([\s,\w\']+)", columns[index + 7]).group(1), m = re.match('.+Result</b>:([\w, ]+)', columns[index + 7]) if m: row['result'] = m.group(1) 

Vs

 for line in rows[1][1].findAll('b'): content = str(line.nextSibling)[1:] if line.getText() == 'Venue': data['venue'] = content else: data['result'] = content 

I think this one is hard to call. I think my lines of code have less going on in them, but I think it might be somewhat more obvious what is happening in the regular expression.

My ending conclusion is that HTML parsers for scraping is often much nicer then using a regex, and sometimes a little worse. On the balance, I think its a way better idea.

\$\endgroup\$
0
0
\$\begingroup\$

Personally, I don't like dependencies (in this case requests and BeautifulSoup). Why just not to use the standard modules?

import urllib, re URL = 'http://icc-cricket.yahoo.net/match_zone/series/fixtures.php?seriesCode=ENG_WI_2012' page = urllib.urlopen(URL).read() # just for the testing - the url above is unavailable, grab it from google cache and write to the page.txt #page = open("page.txt").read() # find the table table = re.match('<div class="bElementBox">.+<tbody>(.+).+</tbody>.+</div>', page).group(1) # extract all columns columns = re.compile('<td.*?>(.*?)</td>', re.DOTALL | re.M).findall(table) result = [] # process them by 8 for index in range(0, len(columns), 8): row = { 'home_team' : columns[index + 2], 'away_team' : columns[index + 5], 'match_type': re.match('.+<b>([A-Z,0-9]+)</b>',columns[index + 6]).group(1), 'venue' : re.match("<b>Venue</b>: ([\s,\w\']+)", columns[index + 7]).group(1), 'timings' : columns[index], } m = re.match('.+Result</b>:([\w, ]+)', columns[index + 7]) if m: row['result'] = m.group(1) result.append(row) for r in result: print(r) 
\$\endgroup\$
4
  • \$\begingroup\$ Yeah, I was thinking about going back and using urllib, but won't you agree that parsing raw HTML can be a too much with just regex? \$\endgroup\$ Commented Jul 12, 2012 at 20:05
  • 3
    \$\begingroup\$ No, you can't parse HTML with regex. Really. stackoverflow.com/a/1732454/282912 \$\endgroup\$ Commented Jul 12, 2012 at 23:17
  • \$\begingroup\$ It's possible and it's easy :) \$\endgroup\$ Commented Jul 13, 2012 at 15:01
  • 1
    \$\begingroup\$ What do you have against using libraries? From a recent question of mine I was told to avoid using urllib. There's an item in Effective Java stating that you should know and use the libraries, I'm sure this applies to Python as well. \$\endgroup\$ Commented May 7, 2014 at 12:05

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.