0

I have a Python script that uses BS4 to grab the html of a webpage. Then I locate a specific header field in the html to extract the text. I do this with the following:

r = br.open("http://example.com") html = r.read() r.close() soup = BeautifulSoup(html) # Get the contents of the html tag (h1) that displays results searchResult = soup.find("h1").contents[0] # Get only the number, remove all text if not(searchResult == None): searchResultNum = int(re.match(r'\d+', searchResult).group()) else: searchResultNum = 696969 

The actual HTML code doesn't change. It always looks like this:

<div id="resultsCount"> <h1 class="f12">606 Results matched</h1> </div> 

The problem is, my script runs fine for maybe 4 minutes (varies) and crashes with:

Traceback (most recent call last): File "C:\Users\Me\Documents\Aptana Studio 3 Workspace\PythonScripts\PythonScripts\setupscript.py", line 109, in <module> searchResultNum = int(re.match(r'\d+', searchResult).group()) AttributeError: 'NoneType' object has no attribute 'group' 

I thought I was handling this error. I guess I just do not understand it. Can you help?

Thanks.

1 Answer 1

1

If searchResult does not start with a number re.match(r'\d+', searchResult) will be None and None does not have a group attribute. Also if not(searchResult == None): is kinda bad, use if searchResult:

searchResultNum = 696969 if searchResult: m = re.match(r'\d+', searchResult) if m: searchResultNum = int(m.group()) 
Sign up to request clarification or add additional context in comments.

2 Comments

...the consequence being that he probably should be using re.search() instead of re.match()...
unless he only wants numbers at the beginning, his example text 606 Results matched kind of implies that he does.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.