2

I'm working on a project that parses data from a lot of websites. Most of my code is done, so i'm looking forward to use asyncio in order to eliminate that I/O waiting, but still i wanted to test how threading would work, better or worse. To do that, i wrote some simple code to make requests to 100 websites. Btw i'm using requests_html library for that, fortunately it supports asynchronous requests as well.

asyncio code looks like:

import requests import time from requests_html import AsyncHTMLSession aio_session = AsyncHTMLSession() urls = [...] # 100 urls async def fetch(url): try: response = await aio_session.get(url, timeout=5) status = 200 except requests.exceptions.ConnectionError: status = 404 except requests.exceptions.ReadTimeout: status = 408 if status == 200: return { 'url': url, 'status': status, 'html': response.html } return { 'url': url, 'status': status } def extract_html(urls): tasks = [] for url in urls: tasks.append(lambda url=url: fetch(url)) websites = aio_session.run(*tasks) return websites if __name__ == "__main__": start_time = time.time() websites = extract_html(urls) print(time.time() - start_time) 

Execution time (multiple tests):

13.466366291046143 14.279950618743896 12.980706453323364 

BUT If i run an example with threading:

from queue import Queue import requests from requests_html import HTMLSession from threading import Thread import time num_fetch_threads = 50 enclosure_queue = Queue() html_session = HTMLSession() urls = [...] # 100 urls def fetch(i, q): while True: url = q.get() try: response = html_session.get(url, timeout=5) status = 200 except requests.exceptions.ConnectionError: status = 404 except requests.exceptions.ReadTimeout: status = 408 q.task_done() if __name__ == "__main__": for i in range(num_fetch_threads): worker = Thread(target=fetch, args=(i, enclosure_queue,)) worker.setDaemon(True) worker.start() start_time = time.time() for url in urls: enclosure_queue.put(url) enclosure_queue.join() print(time.time() - start_time) 

Execution time (multiple tests):

7.476433515548706 6.786043643951416 6.717151403427124 

The thing that i don't understand .. both libraries are used against I/O problems, but why are threads faster ? The more i increase the number of threads, the more resources it uses but it's a lot faster.. Can someone please explain to me why are threads faster than asyncio in my example ?

Thanks in advance.

2
  • The line "websites = extract_html(urls:100])" in the async-io code seems to be messed up. Commented Jun 22, 2020 at 7:38
  • @Roy2012 Fixed, forgot to close the parentheses when pasting the code. Commented Jun 22, 2020 at 7:40

1 Answer 1

5

It turns out requests-html uses a pool of threads for running the requests. The default number of threads is the number of core on the machine multiplied by 5. This probably explains the difference in performance you noticed.

You might want to try the experiment again using aiohttp instead. In the case of aiohttp, the underlying socket for the HTTP connection is actually registered in the asyncio event loop, so no threads should be involved here.

Sign up to request clarification or add additional context in comments.

4 Comments

Question: out of curiosity: the GitHub repo of requests-html doesn't seem to include any actual code. Just tests, doc, and an 'ext' directory with a single file. Where's the actual code?
There's a link to the source code in my answer, the file is called request_html.py.
@Vincent Thank you, great response, it makes sense, tonight i'm going to try to write my own function with aiohttp and i'm going to reply back with the results.
Came back to say that everything works good with aiohttp, a lot faster. Indeed the problem was from requests_html library.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.