2

I want to download many files from dukaskopy. A typical url looks like this.

url = 'http://datafeed.dukascopy.com/datafeed/AUDUSD/2014/01/02/00h_ticks.bi5' 

I tried the answer here but most of the files are of size 0.

But when I simply looped using wget(see below), I got complete files.

import wget from urllib.error import HTTPError pair = 'AUDUSD' for year in range(2014,2015): for month in range(1,13): for day in range(1,32): for hour in range(24): try: url = 'http://datafeed.dukascopy.com/datafeed/' + pair + '/' + str(year) + '/' + str(month-1).zfill(2) + '/' + str(day).zfill(2) + '/' + str(hour).zfill(2) + 'h_ticks.bi5' filename = pair + '-' + str(year) + '-' + str(month-1).zfill(2) + '-' + str(day).zfill(2) + '-' + str(hour).zfill(2) + 'h_ticks.bi5' x = wget.download(url, filename) # print(url) except HTTPError as err: if err.code == 404: print((year, month,day, hour)) else: raise 

I have utilized the following code earlier for scraping websites but not for downloading files.

#!/usr/bin/env python3 # -*- coding: utf-8 -*- from aiohttp import ClientSession, client_exceptions from asyncio import Semaphore, ensure_future, gather, run from json import dumps, loads limit = 10 http_ok = [200] async def scrape(url_list): tasks = list() sem = Semaphore(limit) async with ClientSession() as session: for url in url_list: task = ensure_future(scrape_bounded(url, sem, session)) tasks.append(task) result = await gather(*tasks) return result async def scrape_bounded(url, sem, session): async with sem: return await scrape_one(url, session) async def scrape_one(url, session): try: async with session.get(url) as response: content = await response.read() except client_exceptions.ClientConnectorError: print('Scraping %s failed due to the connection problem', url) return False if response.status not in http_ok: print('Scraping%s failed due to the return code %s', url, response.status) return False content = loads(content.decode('UTF-8')) return content if __name__ == '__main__': urls = ['http://demin.co/echo1/', 'http://demin.co/echo2/'] res = run(scrape(urls)) print(dumps(res, indent=4)) 

There is an answer to download multiple files using multiprocessing here. But I think asyncio could be faster.

When the files of 0 size are returned it could be the server limiting number of requests but I still would like to explore if there is a possibility of downloading multiple files using wget and asyncio.

1 Answer 1

6

Here is an example. Decode/Encode, as well as writing operations should be fixed depends on the target data type.

 #!/usr/bin/env python3 # -*- coding: utf-8 -*- from aiofile import AIOFile from aiohttp import ClientSession from asyncio import ensure_future, gather, run, Semaphore from calendar import monthlen from lzma import open as lzma_open from struct import calcsize, unpack from io import BytesIO from json import dumps http_ok = [200] limit = 5 base_url = 'http://datafeed.dukascopy.com/datafeed/{}/{}/{}/{}/{}h_ticks.bi5' fmt = '>3i2f' chunk_size = calcsize(fmt) async def download(): tasks = list() sem = Semaphore(limit) async with ClientSession() as session: for pair in ['AUDUSD']: for year in [2014, 2015]: for month in range(1, 12): for day in range(1, monthlen(year, month)): for hour in range(0, 23): tasks.append(ensure_future(download_one(pair=pair, year=str(year).zfill(2), month=str(month).zfill(2), day=str(day).zfill(2), hour=str(hour).zfill(2), session=session, sem=sem))) return await gather(*tasks) async def download_one(pair, year, month, day, hour, session, sem): url = base_url.format(pair, year, month, day, hour) data = list() async with sem: async with session.get(url) as response: content = await response.read() if response.status not in http_ok: print(f'Scraping {url} failed due to the return code {response.status}') return if content == b'': print(f'Scraping {url} failed due to the empty content') return with lzma_open(BytesIO(content)) as f: while True: chunk = f.read(chunk_size) if chunk: data.append(unpack(fmt, chunk)) else: break async with AIOFile(f'{pair}-{year}-{month}-{day}-{hour}.bi5', 'w') as fl: await fl.write(dumps(data, indent=4)) return if __name__ == '__main__': run(download()) 

The source code is available here

Sign up to request clarification or add additional context in comments.

5 Comments

I am getting Dumping failed due to the incorrect content for all urls.
I know, I don't how to works with bi5, what is it, and what headers should be.
Do you think this helps?
Updated. Format output as u want. I just dumped with json.
Now I am getting "failed due to the return code 503". There is nothing wrong with code. It's the website limiting parallel requests which ironically is defeating the purpose of using asyncio. But I think this is useful in other cases. Thanks a lot for the effort!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.