2

I am screenshotting several thousand web pages with pyppeteer. I discovered by accident, that running the same script in 2 open terminals doubles the output I get. I tested this by opening up to 6 terminals and running the script and I was able to get up to 6 times the performance.

I am considering using loop.run_in_executor to run the script in multiple processes or threads from a main program.

Is this the right call or is am I hitting some IO/CPU limit in my script?

Here is how I'm thinking of doing it. I don't know if this is the right thing to do.

import asyncio import concurrent.futures async def blocking_io(): # File operations (such as logging) can block the # event loop: run them in a thread pool. with open('/dev/urandom', 'rb') as f: return f.read(100) async def cpu_bound(): # CPU-bound operations will block the event loop: # in general it is preferable to run them in a # process pool. return sum(i * i for i in range(10 ** 7)) def wrap_blocking_io(): return asyncio.run(blocking_io()) def wrap_cpu_bound(): return asyncio.run(cpu_bound()) async def main(): loop = asyncio.get_running_loop() # Options: # 1. Run in the default loop's executor: result = await loop.run_in_executor( None, wrap_blocking_io) print('default thread pool', result) # 2. Run in a custom thread pool: with concurrent.futures.ThreadPoolExecutor(max_workers=6) as pool: result = await loop.run_in_executor( pool, wrap_blocking_io) print('custom thread pool', result) # 3. Run in a custom process pool: with concurrent.futures.ProcessPoolExecutor(max_workers=6) as pool: result = await loop.run_in_executor( pool, wrap_cpu_bound) print('custom process pool', result) asyncio.run(main()) 
2
  • 1
    there's nothing particularly bad in using loop.run_in_executor within your async code Commented Jun 7, 2019 at 9:12
  • Is it okay to use it to run an asynchronous function? I've shared an example of how I'd do this. I don't think there's anything blocking in my code. Commented Jun 7, 2019 at 9:14

1 Answer 1

1

I tested this by opening up to 6 terminals and running the script and I was able to get up to 6 times the performance.

Since pyppeteer is already asynchronous I presume you just don't run multiple browsers parallely and that's why you have increased output when you run multiple processes.

To run some coroutines concurrently ("in parallel") you usually use something like asyncio.gather. Does you code have it? If answer is no, check this example - this is how you should run multiple jobs:

responses = await asyncio.gather(*tasks) 

If you already using asyncio.gather consider to provide Minimal, Reproducible Example to make it easier to understand what happens.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.