I've been trying to convert a large file with many lines (27 billion) to JSON. Google Compute recommends that I take advantage of multithreading to improve write times. I've converted my code from this:
import json import progressbar f = open('output.txt', 'r') r = open('json.txt', 'w') import math num_permutations = (math.factorial(124)/math.factorial((124-5))) main_bar = progressbar.ProgressBar(maxval=num_permutations, \ widgets=[progressbar.Bar('=', '[', ']'), ' ', progressbar.Percentage(), progressbar.AdaptiveETA()]) main_bar.start() m = 0 for lines in f: x = lines[:-1].split(' ') x = json.dumps(x) x += '\n' r.write(x) m += 1 main_bar.update(m) to this:
import json import progressbar from Queue import Queue import threading q = Queue(maxsize=5) def worker(): while True: task = q.get() r.write(task) q.task_done() for i in range(4): t = threading.Thread(target=worker) t.daemon = True t.start() f = open('output.txt', 'r') r = open('teams.txt', 'w') import math num_permutations = (math.factorial(124)/math.factorial((124-5))) main_bar = progressbar.ProgressBar(maxval=num_permutations, \ widgets=[progressbar.Bar('=', '[', ']'), ' ', progressbar.Percentage(), progressbar.AdaptiveETA()]) main_bar.start() m = 0 for lines in f: x = lines[:-1].split(' ') x = json.dumps(x) x += '\n' q.put(x) m += 1 main_bar.update(m) I've copied the Queue coding pretty much straight from the module manual.
Before, the whole script would take 2 days. Now it is saying 20 days! I'm not quite sure why, could anyone explain this to me?
EDIT: This could be considered a Python Global Interpreter Lock (GIL) problem, however, I don't think it is so - it is not computationally intensive and is an IO bottleneck problem, from the threading docs:
If you want your application to make better use of the computational resources of multi-core machines, you are advised to use multiprocessing. However, threading is still an appropriate model if you want to run multiple I/O-bound tasks simultaneously.
My understanding of this is limited, but I believe this to be the latter, ie. an IO bound task. This was my original thought when I wanted to go for multi-threading in the first place: That the computation was being blocked by IO calls that could be put to a separate thread to allow the computation functions to continue.
FURTHER EDIT: Perhaps the fact is that I've got an IO block from the INPUT, and that is what is slowing it down. Any ideas on how I could effectively send the 'for' loop to a separate thread? Thanks!
factorial(a)/factorial(a-b) = reduce(operator.mul, range(a, a-b, -1))