Incremental text file processing for parallel processing

Question

I'm at the first experience with the Julia language, and I'm quite surprises by its simplicity.

I need to process big files, where each line is composed by a set of tab separated strings. As a first example, I started by a simple count program; I managed to use @parallel with the following code:

d = open(f) lis = readlines(d) ntrue = @parallel (+) for li in lis contains(li,s) end println(ntrue) close(d) end

I compared the parallel approach against a simple "serial" one with a 3.5GB file (more than 1 million lines). On a 4-cores Intel Xeon E5-1620, 3.60GHz, with 32GB of RAM, What I've got is:

Parallel = 10.5 seconds; Serial = 12.3 seconds; Allocated Memory = 5.2 GB;

My first concern is about memory allocation; is there a better way to read the file incrementally in order to lower the memory allocation, while preserving the benefits of parallelizing the processing? Secondly, since the CPU gain related to the use of @parallel is not astonishing, I'm wondering if it might be related to the specific case itself, or to my naive use of the parallel features of Julia? In the latter case, what would be the right approach to follow? Thanks for the help!

StefanKarpinski · Accepted Answer · 2016-03-14 22:31:18Z

Your program is reading all of the file into memory as a large array of strings at once. You may want to try a serial version that processes the lines one at a time instead (i.e. streaming):

const s = "needle" # it's important for this to be const open(f) do d ntrue = 0 for li in eachline(d) ntrue += contains(li,s) end println(ntrue) end

This avoids allocating an array to hold all of the strings and avoids allocating all of string objects at once, allowing the program to reuse the same memory by periodically reclaiming it during garbage collection. You may want to try this and see if that improves the performance sufficiently for you. The fact that s is const is important since it allows the compiler to predict the types in the for loop body, which isn't possible if s could change value (and thus type) at any time.

If you still want to process the file in parallel, you will have to open the file in each worker and advance each worker's read cursor (using the seek function) to an appropriate point in the file to start reading lines. Note that you'll have to be careful to avoid reading in the middle of a line and you'll have to make sure each worker does all of the lines assigned to it and no more – otherwise you might miss some instances of the search string or double count some of them.

If this workload isn't just an example and you actually just want to count the number of lines in which a certain string occurs in a file, you may just want to use the grep command, e.g. calling it from Julia like this:

julia> s = "boo" "boo" julia> f = "/usr/share/dict/words" "/usr/share/dict/words" julia> parse(Int, readchomp(`grep -c -F $s $f`)) 292

Since the grep command has been carefully optimized over decades to search text files for lines matching certain patterns, it's hard to beat its performance. [Note: if it's possible that zero lines contain the pattern you're looking for, you will want to wrap the grep command in a call to the ignorestatus function since the grep command returns an error status code when there are no matches.]

Thanks for the hint. The single processor version with the "const" trick, where eachline(f) should be changed in eachline(d), does better than both the version with @parallel and grep: 8.5 seconds. Nevertheless, I will need to do more complex stuff, for what I'd rather use parallelization. Is there a reference from where I can start from? Isn't @parallel supposed to split the array of strings between the workers in order to let them do the processing independently? Adding addprocs(3) makes memory increase a lot (I should exit the computation to avoid running out of memory).
I'm surprised (but pleased) that the efficient serial Julia version is faster than grep. Parallel processing adds overhead to each processor – you won't save on total memory usage since the data needs to be loaded somewhere. If you run across multiple machines, then you can access more total memory, but on a single system it's not a way to reduce memory usage.

Collectives™ on Stack Overflow

Incremental text file processing for parallel processing

1 Answer 1

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Related