A simple way to collect output from parallel md5sum subprocesses is to use a thread pool and write to the file from the main process:
from multiprocessing.dummy import Pool # use threads from subprocess import check_output def md5sum(filename): try: return check_output(["md5sum", filename]), None except Exception as e: return None, e if __name__ == "__main__": p = Pool(number_of_processes) # specify number of concurrent processes with open("md5sums.txt", "wb") as logfile: for output, error in p.imap(md5sum, filenames): # provide filenames if error is None: logfile.write(output)
- the output from
md5sum is small so you can store it in memory imap preserves order number_of_processes may be different from number of files or CPU cores (larger values doesn't mean faster: it depends on relative performance of IO (disks) and CPU)
You can try to pass several files at once to the md5sum subprocesses.
You don't need external subprocess in this case; you can calculate md5 in Python:
import hashlib from functools import partial def md5sum(filename, chunksize=2**15, bufsize=-1): m = hashlib.md5() with open(filename, 'rb', bufsize) as f: for chunk in iter(partial(f.read, chunksize), b''): m.update(chunk) return m.hexdigest()
To use multiple processes instead of threads (to allow the pure Python md5sum() to run in parallel utilizing multiple CPUs) just drop .dummy from the import in the above code.