Time per process increases with number of processes when reading JSON in Python. Why?

Question

I am trying to parallelize the processing of some JSON files using Python's multiprocess module. The amount of time required to process a fixed number of files within a subprocess seems to depend on the overall number of files. My expectation was that this should be constant.

In the example below, I compare one process reading 100 files vs. ten processes reading 1000 files (100 each) vs. 100 processes reading 10000 files (100 each). In all three cases, each process reads exactly 100 files, so presumably each individual subprocess should take the same amount of time. (The total time for the parent process to run depends on the total number of processes and the total number of cores, but that's not the time I'm referring to.) In fact, the per-process time is increasing.

Simplified Example

To avoid complications arising from variability in file size and other complications due to multiple processes reading the same file, I have simply made 10,000 copies of a single 144KB JSON document:

for i in `seq 1 10000`; do cp sample.json data/${i}.json; done

The Python script looks as follows:

import glob import json from multiprocessing import Pool import time # process all files in chunk, printing # chunk ID, # files in chunk, time to process all files def process_chunk(chunk): chunk_id, files = chunk t0 = time.time() for f in files: process_single_file(f) t1 = time.time() print(chunk_id, len(files), t1 - t0) # this does nothing but reads the file into memory and parses the json # but that is enough to illustrate the point def process_single_file(filename): with open(filename, 'r') as f: s = f.read() data = json.loads(s) if __name__ == '__main__': files = glob.glob(f'data/*.json') files_per_chunk = 100 num_chunks = int(sys.argv[1]) chunks = [files[(i * files_per_chunk):((i + 1) * files_per_chunk)] for i in range(num_chunks)] with Pool(processes = num_chunks) as pool: pool.map(process_chunk, enumerate(chunks))

For example, running ./my-script.py 1 should process one chunk with 100 files in one process. Running ./my-script 10 should process ten chunks with 100 files each in ten processes. Regardless of the number of cores, I believe the amount of time to process each chunk should be the same in both cases.

Results

> ./my-script.py 1 0 100 0.07086801528930664 > ./my-script.py 10 0 100 0.16899609565734863 1 100 0.19768595695495605 4 100 0.17228388786315918 2 100 0.1956641674041748 3 100 0.17895913124084473 5 100 0.16188788414001465 6 100 0.16206908226013184 7 100 0.15983009338378906 8 100 0.15669989585876465 9 100 0.15811610221862793 > ./my-script.py 100 3 100 3.8171892166137695 4 100 3.8234598636627197 1 100 3.8310959339141846 9 100 3.8683879375457764 7 100 3.871474027633667 6 100 3.878866195678711 ...

I have tried a number of variations of this on both a Mac with 6 cores and an EC2 instance running Linux with 24 dual cores. Can someone help me understand the scaling of process time with number of processes?

It rarely makes sense to configure Pool to have more processes to manage than you have available CPUs. Also, why would you expect each sub-process to take the same amount of time when it has to handle differing numbers of files? — jackal
– jackal, Commented Aug 21, 2024 at 9:19
Agreed on #CPUs vs # processes, but that isn't the point here. Each sub-process is handling the exact same number of files. — broken.eggshell
– broken.eggshell, Commented Aug 21, 2024 at 13:35

Collectives™ on Stack Overflow

Time per process increases with number of processes when reading JSON in Python. Why?

0

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.