3

I've been trying to convert a large file with many lines (27 billion) to JSON. Google Compute recommends that I take advantage of multithreading to improve write times. I've converted my code from this:

import json import progressbar f = open('output.txt', 'r') r = open('json.txt', 'w') import math num_permutations = (math.factorial(124)/math.factorial((124-5))) main_bar = progressbar.ProgressBar(maxval=num_permutations, \ widgets=[progressbar.Bar('=', '[', ']'), ' ', progressbar.Percentage(), progressbar.AdaptiveETA()]) main_bar.start() m = 0 for lines in f: x = lines[:-1].split(' ') x = json.dumps(x) x += '\n' r.write(x) m += 1 main_bar.update(m) 

to this:

import json import progressbar from Queue import Queue import threading q = Queue(maxsize=5) def worker(): while True: task = q.get() r.write(task) q.task_done() for i in range(4): t = threading.Thread(target=worker) t.daemon = True t.start() f = open('output.txt', 'r') r = open('teams.txt', 'w') import math num_permutations = (math.factorial(124)/math.factorial((124-5))) main_bar = progressbar.ProgressBar(maxval=num_permutations, \ widgets=[progressbar.Bar('=', '[', ']'), ' ', progressbar.Percentage(), progressbar.AdaptiveETA()]) main_bar.start() m = 0 for lines in f: x = lines[:-1].split(' ') x = json.dumps(x) x += '\n' q.put(x) m += 1 main_bar.update(m) 

I've copied the Queue coding pretty much straight from the module manual.

Before, the whole script would take 2 days. Now it is saying 20 days! I'm not quite sure why, could anyone explain this to me?

EDIT: This could be considered a Python Global Interpreter Lock (GIL) problem, however, I don't think it is so - it is not computationally intensive and is an IO bottleneck problem, from the threading docs:

If you want your application to make better use of the computational resources of multi-core machines, you are advised to use multiprocessing. However, threading is still an appropriate model if you want to run multiple I/O-bound tasks simultaneously.

My understanding of this is limited, but I believe this to be the latter, ie. an IO bound task. This was my original thought when I wanted to go for multi-threading in the first place: That the computation was being blocked by IO calls that could be put to a separate thread to allow the computation functions to continue.

FURTHER EDIT: Perhaps the fact is that I've got an IO block from the INPUT, and that is what is slowing it down. Any ideas on how I could effectively send the 'for' loop to a separate thread? Thanks!

13
  • Have you tried this on a significantly smaller portion of your dataset? What sort of environment are you working on? Commented Sep 18, 2015 at 18:28
  • 2
    You've only added the complexity of multiple writers sharing access to a single output file. You might see improvement if each thread were to write to a separate file, although you would probably still want to concatenate the various files in the end. Commented Sep 18, 2015 at 18:28
  • 2
    Whatever exactly you do, such fine segmentation probably makes the communication the biggest burden of the application. Why not send chunks of a few thousand lines at a time? And why not write to the file a few thousand lines in a single operation? Commented Sep 18, 2015 at 18:43
  • 1
    FWIW factorial(a)/factorial(a-b) = reduce(operator.mul, range(a, a-b, -1)) Commented Sep 18, 2015 at 18:59
  • 1
    "Python (at least CPython) has a global interpreter lock that prevents more than one thread from accessing memory at a time." is non-sense. GIL is released during I/O, C extensions modules such as numpy, regex, lxml may also release GIL during long computations in C. GIL is about Python code; it is not about memory. Though I don't see any reason that multithreading somehow would make hard disks to work any harder (if they are the bottleneck here). Commented Sep 19, 2015 at 19:51

2 Answers 2

2

If we remove the progressbar code then your code is equivalent to:

#!/usr/bin/env python2 import json import sys for line in sys.stdin: json.dump(line.split(), sys.stdout) # split on any whitespace print 

To improve the time performance, you should measure it first -- take a small input file so that the execution is no more than a minute and run:

$ /usr/bin/time ./your-script < output.txt > json.txt 

I don't know why do you think writing binary blobs from multiple threads to the same file should be any faster.

What are the candidates for the performance bottleneck here:

  • the loop overhead (if lines are small and disks are fast). Put the code inside a function (it may improve performance on CPython due to replacing global lookup with local)
  • json.dump() (unlikely but measure it anyway) -- play with parameters such as ensure_ascii=False and measure the results. Try other json modules, different Python implementations.
  • disk I/O -- put the result file on a different physical disk (run something like iotop, csysdig to see how the process consumes resources)
  • unicode <> bytes conversions, EOL conversions -- open files in binary mode, encode json text to bytes
Sign up to request clarification or add additional context in comments.

Comments

1

If you want to speed this up, don't use Python at all--the task is simple enough to handle with Unix filters, for example:

sed 's/ /", "/g; s/^/["/; s/$/"]/' output.txt > json.txt 

For an explanation of how this works, see here: https://stackoverflow.com/a/14427404/4323

The problem with using Python for this, if you care about speed, is that you're fundamentally reading one line at a time from the input file. Now, there are fancy ways you could do the reading (divide and conquer), but if you just want to speed it up, the above should do the trick.

If you want a progress bar, use pv (from package moreutils on many Linux systems) which I think would go this way:

pv output.txt | sed 's/ /", "/g; s/^/["/; s/$/"]/' > json.txt 

6 Comments

note: it may produce invalid json.txt if output.txt contains Unicode characters.
It does have unicode characters, what kind of invalidity are we looking at? Also the source file (output.txt) is around 500GB, any issues with this?
500 GB is not a problem for sed, the STream Editor, because it forgets about prior lines as it goes along. As for Unicode, just try it, it may work as it is (I'm not promising though!). Quotes in your source file of course could mess it up too, but that was the case with your Python code as well.
@JohnZwinck: wrong, Python code escapes quotes, and other Unicode characters that can't appear verbatim inside json string. It doesn't matter how fast your command produces wrong results.
@J.F.Sebastian: Oh I see that now, thanks for pointing it out. As for Unicode, I still think my solution may work, but as I said I haven't tested it for that case.
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.