0

I am trying to scrape some data from https://www.officialcharts.com/ by parallelising web requests using asyncio/aiohttp. I implemented the code given at the link here.

I followed two different procedures. The first one goes like this.

from bs4 import BeautifulSoup from urllib.request import urlopen from selenium import webdriver import time import pandas as pd import numpy as np import re import json import requests from bs4 import BeautifulSoup from datetime import date, timedelta from IPython.display import clear_output import memory_profiler import spotipy import spotipy.util as util import pandas as pd from more_itertools import unique_everseen weeks = [] d = date(1970, 1, 1) d += timedelta(days = 6 - d.weekday()) for i in range(2500): weeks.append(d.strftime('%Y%m%d')) d += timedelta(days = 7) import asyncio from aiohttp import ClientSession import nest_asyncio nest_asyncio.apply() result = [] async def fetch(url, session): async with session.get(url) as response: return await response.read() async def run(r): tasks = [] # Fetch all responses within one Client session, # keep connection alive for all requests. async with ClientSession() as session: for i in range(r): url = 'https://www.officialcharts.com/charts/singles-chart/' + weeks[i] + '/' task = asyncio.ensure_future(fetch(url, session)) tasks.append(task) responses = await asyncio.gather(*tasks) result.append(responses) loop = asyncio.get_event_loop() future = asyncio.ensure_future(run(5)) loop.run_until_complete(future) print('Done') print(result[0][0] == None) 

The problem with above code is, it fails when I make more than simultaneous 1000 requests.

The author of the post implemented a different procedure to address this issue and he claims we can do as many as 10K requests. I followed along his second procedure and here is my code for that.

import random import asyncio from aiohttp import ClientSession import nest_asyncio nest_asyncio.apply() result = [] async def fetch(url, session): async with session.get(url) as response: delay = response.headers.get("DELAY") date = response.headers.get("DATE") print("{}:{} with delay {}".format(date, response.url, delay)) return await response.read() async def bound_fetch(sem, url, session): # Getter function with semaphore. async with sem: await fetch(url, session) async def run(r): tasks = [] # create instance of Semaphore sem = asyncio.Semaphore(1000) # Create client session that will ensure we dont open new connection # per each request. async with ClientSession() as session: for i in range(r): url = 'https://www.officialcharts.com/charts/singles-chart/' + weeks[i] + '/' task = asyncio.ensure_future(bound_fetch(sem, url, session)) tasks.append(task) responses = await asyncio.gather(*tasks) result.append(responses) number = 5 loop = asyncio.get_event_loop() future = asyncio.ensure_future(run(number)) loop.run_until_complete(future) print('Done') print(result[0][0] == None) 

For some reason, this doesn't return any responses.

PS:I am not from CS background and just program for fun. I have no clue what's going on inside the asyncio code.

1 Answer 1

1

Try to use the latest version.

#!/usr/bin/env python3 # -*- coding: utf-8 -*- from aiohttp import ClientSession, client_exceptions from asyncio import Semaphore, ensure_future, gather, run from json import dumps, loads limit = 10 http_ok = [200] async def scrape(url_list): tasks = list() sem = Semaphore(limit) async with ClientSession() as session: for url in url_list: task = ensure_future(scrape_bounded(url, sem, session)) tasks.append(task) result = await gather(*tasks) return result async def scrape_bounded(url, sem, session): async with sem: return await scrape_one(url, session) async def scrape_one(url, session): try: async with session.get(url) as response: content = await response.read() except client_exceptions.ClientConnectorError: print('Scraping %s failed due to the connection problem', url) return False if response.status not in http_ok: print('Scraping%s failed due to the return code %s', url, response.status) return False content = loads(content.decode('UTF-8')) return content if __name__ == '__main__': urls = ['http://demin.co/echo1/', 'http://demin.co/echo2/'] res = run(scrape(urls)) print(dumps(res, indent=4)) 

This is a template of a real project that works as predicted.

You can find this source code here

Sign up to request clarification or add additional context in comments.

9 Comments

I get an error "asyncio.run() cannot be called from a running event loop", in Jupyter-Lab. But it works in python shell. Thanks!
U should check carefully python version, it should be 3.7.2+ strictly because asyncio was changed since last edition
@Dimitrii, is it possible to download multiple files using asyncio and wget? I found one answer here but I would like to modify your answer with wget
Be careful, non-async applications can cause exotic cases, so it's better to use native async possibilities, like aiohttp.get. Especially if you use frameworks, like Flask or Django. Anyway, if it works, you can update an example with an alternative solution. Anyway, why do you need wget?
I am a total noob. I posted a question here
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.