I am trying to parallelize the processing of some JSON files using Python's multiprocess module. The amount of time required to process a fixed number of files within a subprocess seems to depend on the overall number of files. My expectation was that this should be constant.
In the example below, I compare one process reading 100 files vs. ten processes reading 1000 files (100 each) vs. 100 processes reading 10000 files (100 each). In all three cases, each process reads exactly 100 files, so presumably each individual subprocess should take the same amount of time. (The total time for the parent process to run depends on the total number of processes and the total number of cores, but that's not the time I'm referring to.) In fact, the per-process time is increasing.
Simplified Example
To avoid complications arising from variability in file size and other complications due to multiple processes reading the same file, I have simply made 10,000 copies of a single 144KB JSON document:
for i in `seq 1 10000`; do cp sample.json data/${i}.json; done
The Python script looks as follows:
import glob import json from multiprocessing import Pool import time # process all files in chunk, printing # chunk ID, # files in chunk, time to process all files def process_chunk(chunk): chunk_id, files = chunk t0 = time.time() for f in files: process_single_file(f) t1 = time.time() print(chunk_id, len(files), t1 - t0) # this does nothing but reads the file into memory and parses the json # but that is enough to illustrate the point def process_single_file(filename): with open(filename, 'r') as f: s = f.read() data = json.loads(s) if __name__ == '__main__': files = glob.glob(f'data/*.json') files_per_chunk = 100 num_chunks = int(sys.argv[1]) chunks = [files[(i * files_per_chunk):((i + 1) * files_per_chunk)] for i in range(num_chunks)] with Pool(processes = num_chunks) as pool: pool.map(process_chunk, enumerate(chunks)) For example, running ./my-script.py 1 should process one chunk with 100 files in one process. Running ./my-script 10 should process ten chunks with 100 files each in ten processes. Regardless of the number of cores, I believe the amount of time to process each chunk should be the same in both cases.
Results
> ./my-script.py 1 0 100 0.07086801528930664 > ./my-script.py 10 0 100 0.16899609565734863 1 100 0.19768595695495605 4 100 0.17228388786315918 2 100 0.1956641674041748 3 100 0.17895913124084473 5 100 0.16188788414001465 6 100 0.16206908226013184 7 100 0.15983009338378906 8 100 0.15669989585876465 9 100 0.15811610221862793 > ./my-script.py 100 3 100 3.8171892166137695 4 100 3.8234598636627197 1 100 3.8310959339141846 9 100 3.8683879375457764 7 100 3.871474027633667 6 100 3.878866195678711 ... I have tried a number of variations of this on both a Mac with 6 cores and an EC2 instance running Linux with 24 dual cores. Can someone help me understand the scaling of process time with number of processes?