async 'read_csv' of several data frames in pandas - why isn't it faster

Question

I want to create a code that reads several pandas data frames asynchronously, for example from a CSV file (or from a database)

I wrote the following code, assuming that it should import the two data frames faster, however it seems to do it slower:

import timeit import pandas as pd import asyncio train_to_save = pd.DataFrame(data={'feature1': [1, 2, 3],'period': [1, 1, 1]}) test_to_save = pd.DataFrame(data={'feature1': [1, 4, 12],'period': [2, 2, 2]}) train_to_save.to_csv('train.csv') test_to_save.to_csv('test.csv') async def run_async_train(): return pd.read_csv('train.csv') async def run_async_test(): return pd.read_csv('test.csv') async def run_train_test_asinc(): df = await asyncio.gather(run_async_train(), run_async_test()) return df start_async = timeit.default_timer() async_train,async_test=asyncio.run(run_train_test_asinc()) finish_async = timeit.default_timer() time_to_run_async=finish_async-start_async start = timeit.default_timer() train=pd.read_csv('train.csv') test = pd.read_csv('test.csv') finish = timeit.default_timer() time_to_run_without_async = finish - start print(time_to_run_async<time_to_run_without_async)

Why does it read the two data frames faster in the non-async version?

Just to make it clear, I'm really going to read the data from Bigquery so im really interested in speeding both requests (train & test) using the code above.

Thanks in advance!

When it comes to reading (large) files, the bottleneck is usually seeking/reading from the disc, not processing power. So reading two files at the same time might not increase processing power, since the disc has to physically jump back and fort between the two different locations (files). — Quang Hoang
– Quang Hoang, Commented Sep 10, 2019 at 13:22
It depends. Databases are usually designed for concurrent requests, so likely yes, but take it with a grain of salt. — Quang Hoang
– Quang Hoang, Commented Sep 10, 2019 at 13:30

Mingwei Samuel · Accepted Answer · 2020-03-03 22:03:22Z

pd.read_csv isn't an async method, so I don't believe you're actually getting any parallelism out of this. You'd need to use an async file library like aiofiles to read the files into buffers asynchronously, then send those to pd.read_csv(.).

Note that most filesystems aren't really async, so aiofiles is functionally a thread pool. However it will still likely be faster than reading serially.

Here's an example I had with aiohttp getting csvs from urls:

import io import asyncio import aiohttp import pandas as pd async def get_csv_async(client, url): # Send a request. async with client.get(url) as response: # Read entire resposne text and convert to file-like using StringIO(). with io.StringIO(await response.text()) as text_io: return pd.read_csv(text_io) async def get_all_csvs_async(urls): async with aiohttp.ClientSession() as client: # First create all futures at once. futures = [ get_csv_async(client, url) for url in urls ] # Then wait for all the futures to complete. return await asyncio.gather(*futures) urls = [ # Some random CSV urls from the internet 'https://people.sc.fsu.edu/~jburkardt/data/csv/hw_25000.csv', 'https://people.sc.fsu.edu/~jburkardt/data/csv/addresses.csv', 'https://people.sc.fsu.edu/~jburkardt/data/csv/airtravel.csv', ] if '__main__' == __name__: # Run event loop # can just do `csvs = asyncio.run(get_all_csvs_async(urls))` in python 3.7+ csvs = asyncio.get_event_loop().run_until_complete(get_all_csvs_async(urls)) for csv in csvs: print(csv)

I have to add import nest_asyncio then nest_asyncio.apply(). See stackoverflow.com/questions/46827007/…. No idea why but it worked.

Collectives™ on Stack Overflow

async 'read_csv' of several data frames in pandas - why isn't it faster

1 Answer 1

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Linked

Related