1

I am currently looping through URL's and grabbing data while visiting/crawling websites.

Sometimes a website will have an unreasonably long loading time where no errors are made, but will not fully load to allow chromedriver/urlopen to complete/continue with the script and just stays in limbo.

Dynamically testing for presence of element does not work in this case as the page wont completely load, and pages are not all the same to test for fixed elements (not even abundant tags like html or h1 tags etc).

Basically I am looking for a code that will continue to the next iteration the loop after "x" seconds if the page dosent load

Currently using Selenium (chromedriver) and Beautifulsoup (BS4).

def get_emails_from_list(links): email=[] for link in links: driver.get(link) html=driver.page_source try: raw = BeautifulSoup(html, 'html.parser').get_text() emails = re.findall(r'[\w\.-]+@[\w\.-]+', raw) for em in emails: if em not in email: email.append(emails) except: emails = re.findall(r'[\w\.-]+@[\w\.-]+', str(html)) for em in emails: if em not in email: email.append(emails) try: email2=list(itertools.chain(*email)) except: email2=email return email2 
2
  • What have you tried ? People will help but they won't to write code for you. Commented Nov 10, 2016 at 1:07
  • I have been looking for an answer for awhile, at this point manually restarting and editing the list. took a look at threading.timers which dosent really apply to this problem so well. I am looking at the signal package, didn't know about that one. Looks promising, but I am wholly unfamiliar with it. Commented Nov 10, 2016 at 1:37

1 Answer 1

3

The best/normal way to do this is to set an timeout on the socket or with the library you are using for network io. So you should really consider that.

If not, threads or signals can be used. This one uses signals.

import signal, time, random class TimeoutError (RuntimeError): pass def handler (signum, frame): raise TimeoutError() signal.signal (signal.SIGALRM, handler) for i in range(5): try: signal.alarm (3) time.sleep (random.randint (1,4)) print ('ok', i) except TimeoutError as ex: print ('timeout', i) 

UPDATE:

Apparently this does not work on Windows. According to the documentation: On Windows, signal() can only be called with SIGABRT, SIGFPE, SIGILL, SIGINT, SIGSEGV, or SIGTERM.

On Windows, `signal()` can only be called with `SIGABRT`, `SIGFPE`, `SIGILL`, `SIGINT`, `SIGSEGV`, or `SIGTERM`. 
Sign up to request clarification or add additional context in comments.

2 Comments

thanks. need to try to test, but what are the args signum, frame? in def handler (signum, frame): raise TimeoutError()
@Avaricious_vulture: They are not used in this example. A signal handler gets passed the signal number (in this case signal.SIGALRM) and the (Python) stack frame. Check out the Python documentation for more.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.