I'm working on a project that parses data from a lot of websites. Most of my code is done, so i'm looking forward to use asyncio in order to eliminate that I/O waiting, but still i wanted to test how threading would work, better or worse. To do that, i wrote some simple code to make requests to 100 websites. Btw i'm using requests_html library for that, fortunately it supports asynchronous requests as well.
asyncio code looks like:
import requests import time from requests_html import AsyncHTMLSession aio_session = AsyncHTMLSession() urls = [...] # 100 urls async def fetch(url): try: response = await aio_session.get(url, timeout=5) status = 200 except requests.exceptions.ConnectionError: status = 404 except requests.exceptions.ReadTimeout: status = 408 if status == 200: return { 'url': url, 'status': status, 'html': response.html } return { 'url': url, 'status': status } def extract_html(urls): tasks = [] for url in urls: tasks.append(lambda url=url: fetch(url)) websites = aio_session.run(*tasks) return websites if __name__ == "__main__": start_time = time.time() websites = extract_html(urls) print(time.time() - start_time) Execution time (multiple tests):
13.466366291046143 14.279950618743896 12.980706453323364 BUT If i run an example with threading:
from queue import Queue import requests from requests_html import HTMLSession from threading import Thread import time num_fetch_threads = 50 enclosure_queue = Queue() html_session = HTMLSession() urls = [...] # 100 urls def fetch(i, q): while True: url = q.get() try: response = html_session.get(url, timeout=5) status = 200 except requests.exceptions.ConnectionError: status = 404 except requests.exceptions.ReadTimeout: status = 408 q.task_done() if __name__ == "__main__": for i in range(num_fetch_threads): worker = Thread(target=fetch, args=(i, enclosure_queue,)) worker.setDaemon(True) worker.start() start_time = time.time() for url in urls: enclosure_queue.put(url) enclosure_queue.join() print(time.time() - start_time) Execution time (multiple tests):
7.476433515548706 6.786043643951416 6.717151403427124 The thing that i don't understand .. both libraries are used against I/O problems, but why are threads faster ? The more i increase the number of threads, the more resources it uses but it's a lot faster.. Can someone please explain to me why are threads faster than asyncio in my example ?
Thanks in advance.