1

Which is faster to process a 1TB file: a single machine or 5 networked machines? ("To process" refers to finding the single UTF-16 character with the most occurrences in that 1TB file). The rate of data transfer is 1Gbit/sec, the entire 1TB file resides in 1 computer, and each computer has a quad core CPU.

Below is my attempt at the question using an array of longs (with array size of 2^16) to keep track of the character count. This should fit into memory of a single machine, since 2^16 x 2^3 (size of long) = 2^19 = 0.5MB. Any help (links, comments, suggestions) would be much appreciated. I used the latency times cited by Jeff Dean, and I tried my best to use the best approximations that I knew of. The final answer is:

Single Machine: 5.8 hrs (due to slowness of reading from disk)
5 Networked Machines: 7.64 hrs (due to reading from disk and network)

1) Single Machine a) Time to Read File from Disk --> 5.8 hrs -If it takes 20ms to read 1MB seq from disk, then to read 1TB from disk takes: 20ms/1MB x 1024MB/GB x 1024GB/TB = 20,972 secs = 350 mins = 5.8 hrs b) Time needed to fill array w/complete count data --> 0 sec since it is computed while doing step 1a -At 0.5 MB, the count array fits into L2 cache. Since L2 cache takes only 7 ns to access, the CPU can read & write to the count array while waiting for the disk read. Time: 0 sec since it is computed while doing step 1a c) Iterate thru entire array to find max count --> 0.00625ms -Since it takes 0.0125ms to read & write 1MB from L2 cache and array size is 0.5MB, then the time to iterate through the array is: 0.0125ms/MB x 0.5MB = 0.00625ms d) Total Time Total=a+b+c=~5.8 hrs (due to slowness of reading from disk) 2) 5 Networked Machines a) Time to transfr 1TB over 1Gbit/s --> 6.48 hrs 1TB x 1024GB/TB x 8bits/B x 1s/Gbit = 8,192s = 137m = 2.3hr But since the original machine keeps a fifth of the data, it only needs to send (4/5)ths of data, so the time required is: 2.3 hr x 4/5 = 1.84 hrs *But to send the data, the data needs to be read, which is (4/5)(answer 1a) = (4/5)(5.8 hrs) = 4.64 hrs So total time = 1.84hrs + 4.64 hrs = 6.48 hrs b) Time to fill array w/count data from original machine --> 1.16 hrs -The original machine (that had the 1TB file) still needs to read the remainder of the data in order to fill the array with count data. So this requires (1/5)(answer 1a)=1.16 hrs. The CPU time to read & write to the array is negligible, as shown in 1b. c) Time to fill other machine's array w/counts --> not counted -As the file is being transferred, the count array can be computed. This time is not counted. d) Time required to receive 4 arrays --> (2^-6)s -Each count array is 0.5MB 0.5MB x 4 arrays x 8bits/B x 1s/Gbit = 2^20B/2 x 2^2 x 2^3 bits/B x 1s/2^30bits = 2^25/2^31s = (2^-6)s d) Time to merge arrays --> 0 sec(since it can be merge while receiving) e) Total time Total=a+b+c+d+e =~ a+b =~ 6.48 hrs + 1.16 hrs = 7.64 hrs 
7
  • Not really a programming question... Commented Jul 14, 2012 at 17:59
  • 1
    Pretty sure you were not supposed to calculate it. There are 3 subsystems at work here, disk, network, cpu, that operate independently and can overlap their work. Which makes the job dominated by the slowest part. The disk. Neither way is fastest. Commented Jul 14, 2012 at 18:54
  • 1
    How are you calculating 4Gb for the size of your count array ? You need (2^16)*sizeof(uint64) bytes, that is to say 2^16*8 == 2^19 bytes == 0.5Mb. Commented Jul 14, 2012 at 20:46
  • Your throughput numbers are wrong. On a 1Gb network you get around 100MB per second, not more, and even that if your network is finely tune (unless you're using some crazy Infiniband or something) Commented Jul 16, 2012 at 7:11
  • 1
    You still get it wrong. Point (1.b) should read 0.5 MiB, not 0.5 GiB. Point (2.a) - it takes 5.8 hrs to read the data AND 1.84 hrs to send it (the node that holds the data does not have to send to itself and hence 1.84 hrs instead of 2.3 hrs) or 7.64 hrs in total! Commented Jul 16, 2012 at 9:01

2 Answers 2

1

This is not an answer but just a longer comment. You have miscalculated the size of the frequency array. 1 TiB file contains 550 Gsyms and because nothing is said about their expected freqency, you would need a count array of at least 64-bit integers (that is 8 bytes/element). The total size of this frequency array would be 2^16 * 8 = 2^19 bytes or just 512 KiB and not 4 GiB as you have miscalculated. It would only take ≈4.3 ms to send this data over 1 Gbps link (protocol headers take roughly 3% if you use TCP/IP over Ethernet with an MTU of 1500 bytes /less with jumbo frames but they are not widely supported/). Also this array size perfectly fits in the CPU cache.

You have grossly overestimated the time it would take to process the data and extract the frequency and you have also overlooked the fact that it can overlap disk reads. In fact it is so fast to update the frequency array, which resides in the CPU cache, that the computation time is negligible as most of it will overlap the slow disk reads. But you have underestimated the time it takes to read the data. Even with a multicore CPU you still have only one path to the hard drive and hence you would still need the full 5.8 hrs to read the data in the single machine case.

In fact, this is an exemple kind of data processing that neither benefits from parallel networked processing nor from having more than one CPU core. This is why supercomputers and other fast networked processing systems use distributed parallel file storages that can deliver many GB/s of aggregate read/write speeds.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks @Hristo, especially your point about CPU computation overlapping with the slow disk reads. I've updated the question w/the new estimates, and the numbers make much more sense.
1

You only need to send 0.8tb if your source machine is part of the 5.

It may not even make sense sending the data to other machines. Consider this:

In order to for the source machine to send the data it must first hit the disk in order to read the data into main memory before it send the data over the network. If the data is already in main memory and not being processed, you are wasting that opportunity.

So under the assumption that loading to CPU cache is much less expensive than disk to memory or data over network (which is true, unless you are dealing with alien hardware), then you are better off just doing it on the source machine, and the only place splitting up the task makes sense is if the "file" is somehow created/populated in a distributed way to start with.

So you should only count the disk read time of a 1Tb file, with a tiny bit of overhead for L1/L2 cache and CPU ops. The cache access pattern is optimal since it is sequential so you only cache miss once per piece of data.

The primary point here is that disk is the primary bottleneck which overshadows everything else.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.