"Useless" use of 'cat' increases performance. Why?

Question

The files 1..64 are each 160 MBytes and stored in a RAM disk.

Generated by:

seq 120 | parallel -k 'seq {}0000000 {}9999999 | fmt -30' | head -c 10G > 10G parallel --pipepart --block -1 -a 10G -j 64 'cat > {#}'

nocat:

#!/bin/bash export LC_ALL=C sort -m \ <(sort -m \ <(sort -m \ <(sort -m \ <(sort -m \ <(sort -m \ <((rm 1; sort ) < 1) \ <((rm 2; sort ) < 2) ) \ <(sort -m \ <((rm 3; sort ) < 3) \ <((rm 4; sort ) < 4) ) ) \ <(sort -m \ <(sort -m \ <((rm 5; sort ) < 5) \ <((rm 6; sort ) < 6) ) \ <(sort -m \ <((rm 7; sort ) < 7) \ <((rm 8; sort ) < 8) ) ) ) \ <(sort -m \ <(sort -m \ <(sort -m \ <((rm 9; sort ) < 9) \ <((rm 10; sort ) < 10) ) \ <(sort -m \ <((rm 11; sort ) < 11) \ <((rm 12; sort ) < 12) ) ) \ <(sort -m \ <(sort -m \ <((rm 13; sort ) < 13) \ <((rm 14; sort ) < 14) ) \ <(sort -m \ <((rm 15; sort ) < 15) \ <((rm 16; sort ) < 16) ) ) ) ) \ <(sort -m \ <(sort -m \ <(sort -m \ <(sort -m \ <((rm 17; sort ) < 17) \ <((rm 18; sort ) < 18) ) \ <(sort -m \ <((rm 19; sort ) < 19) \ <((rm 20; sort ) < 20) ) ) \ <(sort -m \ <(sort -m \ <((rm 21; sort ) < 21) \ <((rm 22; sort ) < 22) ) \ <(sort -m \ <((rm 23; sort ) < 23) \ <((rm 24; sort ) < 24) ) ) ) \ <(sort -m \ <(sort -m \ <(sort -m \ <((rm 25; sort ) < 25) \ <((rm 26; sort ) < 26) ) \ <(sort -m \ <((rm 27; sort ) < 27) \ <((rm 28; sort ) < 28) ) ) \ <(sort -m \ <(sort -m \ <((rm 29; sort ) < 29) \ <((rm 30; sort ) < 30) ) \ <(sort -m \ <((rm 31; sort ) < 31) \ <((rm 32; sort ) < 32) ) ) ) ) ) \ <(sort -m \ <(sort -m \ <(sort -m \ <(sort -m \ <(sort -m \ <((rm 33; sort ) < 33) \ <((rm 34; sort ) < 34) ) \ <(sort -m \ <((rm 35; sort ) < 35) \ <((rm 36; sort ) < 36) ) ) \ <(sort -m \ <(sort -m \ <((rm 37; sort ) < 37) \ <((rm 38; sort ) < 38) ) \ <(sort -m \ <((rm 39; sort ) < 39) \ <((rm 40; sort ) < 40) ) ) ) \ <(sort -m \ <(sort -m \ <(sort -m \ <((rm 41; sort ) < 41) \ <((rm 42; sort ) < 42) ) \ <(sort -m \ <((rm 43; sort ) < 43) \ <((rm 44; sort ) < 44) ) ) \ <(sort -m \ <(sort -m \ <((rm 45; sort ) < 45) \ <((rm 46; sort ) < 46) ) \ <(sort -m \ <((rm 47; sort ) < 47) \ <((rm 48; sort ) < 48) ) ) ) ) \ <(sort -m \ <(sort -m \ <(sort -m \ <(sort -m \ <((rm 49; sort ) < 49) \ <((rm 50; sort ) < 50) ) \ <(sort -m \ <((rm 51; sort ) < 51) \ <((rm 52; sort ) < 52) ) ) \ <(sort -m \ <(sort -m \ <((rm 53; sort ) < 53) \ <((rm 54; sort ) < 54) ) \ <(sort -m \ <((rm 55; sort ) < 55) \ <((rm 56; sort ) < 56) ) ) ) \ <(sort -m \ <(sort -m \ <(sort -m \ <((rm 57; sort ) < 57) \ <((rm 58; sort ) < 58) ) \ <(sort -m \ <((rm 59; sort ) < 59) \ <((rm 60; sort ) < 60) ) ) \ <(sort -m \ <(sort -m \ <((rm 61; sort ) < 61) \ <((rm 62; sort ) < 62) ) \ <(sort -m \ <((rm 63; sort ) < 63) \ <((rm 64; sort ) < 64) ) ) ) ) ) | md5sum

withcat:

#!/bin/bash export LC_ALL=C sort -m \ <(sort -m \ <(sort -m \ <(sort -m \ <(sort -m \ <(sort -m \ <((rm 1; sort ) < 1) \ <((rm 2; sort ) < 2) | cat) \ <(sort -m \ <((rm 3; sort ) < 3) \ <((rm 4; sort ) < 4) | cat) | cat) \ <(sort -m \ <(sort -m \ <((rm 5; sort ) < 5) \ <((rm 6; sort ) < 6) | cat) \ <(sort -m \ <((rm 7; sort ) < 7) \ <((rm 8; sort ) < 8) | cat) | cat) | cat) \ <(sort -m \ <(sort -m \ <(sort -m \ <((rm 9; sort ) < 9) \ <((rm 10; sort ) < 10) | cat) \ <(sort -m \ <((rm 11; sort ) < 11) \ <((rm 12; sort ) < 12) | cat) | cat) \ <(sort -m \ <(sort -m \ <((rm 13; sort ) < 13) \ <((rm 14; sort ) < 14) | cat) \ <(sort -m \ <((rm 15; sort ) < 15) \ <((rm 16; sort ) < 16) | cat) | cat) | cat) | cat) \ <(sort -m \ <(sort -m \ <(sort -m \ <(sort -m \ <((rm 17; sort ) < 17) \ <((rm 18; sort ) < 18) | cat) \ <(sort -m \ <((rm 19; sort ) < 19) \ <((rm 20; sort ) < 20) | cat) | cat) \ <(sort -m \ <(sort -m \ <((rm 21; sort ) < 21) \ <((rm 22; sort ) < 22) | cat) \ <(sort -m \ <((rm 23; sort ) < 23) \ <((rm 24; sort ) < 24) | cat) | cat) | cat) \ <(sort -m \ <(sort -m \ <(sort -m \ <((rm 25; sort ) < 25) \ <((rm 26; sort ) < 26) | cat) \ <(sort -m \ <((rm 27; sort ) < 27) \ <((rm 28; sort ) < 28) | cat) | cat) \ <(sort -m \ <(sort -m \ <((rm 29; sort ) < 29) \ <((rm 30; sort ) < 30) | cat) \ <(sort -m \ <((rm 31; sort ) < 31) \ <((rm 32; sort ) < 32) | cat) | cat) | cat) | cat) | cat) \ <(sort -m \ <(sort -m \ <(sort -m \ <(sort -m \ <(sort -m \ <((rm 33; sort ) < 33) \ <((rm 34; sort ) < 34) | cat) \ <(sort -m \ <((rm 35; sort ) < 35) \ <((rm 36; sort ) < 36) | cat) | cat) \ <(sort -m \ <(sort -m \ <((rm 37; sort ) < 37) \ <((rm 38; sort ) < 38) | cat) \ <(sort -m \ <((rm 39; sort ) < 39) \ <((rm 40; sort ) < 40) | cat) | cat) | cat) \ <(sort -m \ <(sort -m \ <(sort -m \ <((rm 41; sort ) < 41) \ <((rm 42; sort ) < 42) | cat) \ <(sort -m \ <((rm 43; sort ) < 43) \ <((rm 44; sort ) < 44) | cat) | cat) \ <(sort -m \ <(sort -m \ <((rm 45; sort ) < 45) \ <((rm 46; sort ) < 46) | cat) \ <(sort -m \ <((rm 47; sort ) < 47) \ <((rm 48; sort ) < 48) | cat) | cat) | cat) | cat) \ <(sort -m \ <(sort -m \ <(sort -m \ <(sort -m \ <((rm 49; sort ) < 49) \ <((rm 50; sort ) < 50) | cat) \ <(sort -m \ <((rm 51; sort ) < 51) \ <((rm 52; sort ) < 52) | cat) | cat) \ <(sort -m \ <(sort -m \ <((rm 53; sort ) < 53) \ <((rm 54; sort ) < 54) | cat) \ <(sort -m \ <((rm 55; sort ) < 55) \ <((rm 56; sort ) < 56) | cat) | cat) | cat) \ <(sort -m \ <(sort -m \ <(sort -m \ <((rm 57; sort ) < 57) \ <((rm 58; sort ) < 58) | cat) \ <(sort -m \ <((rm 59; sort ) < 59) \ <((rm 60; sort ) < 60) | cat) | cat) \ <(sort -m \ <(sort -m \ <((rm 61; sort ) < 61) \ <((rm 62; sort ) < 62) | cat) \ <(sort -m \ <((rm 63; sort ) < 63) \ <((rm 64; sort ) < 64) | cat) | cat) | cat) | cat) | cat) | cat | md5sum

The only difference is that in withcat every sort -m is piped to cat.

The "useless use of cat"-people will make you believe withcat would be slower than nocat. However, the opposite is true by a wide margin:

$ time bash nocat c933d81faea7b8dec8eb64ca0b044d74 - real 3m40.854s user 2m48.687s sys 0m49.135s $ time bash withcat c933d81faea7b8dec8eb64ca0b044d74 - real 2m21.812s user 2m16.651s sys 1m36.135s

The test is run on a 64-core machine, that does nothing else. Everything is in RAM (so this is not due to slow disks). Every test was run 3 times, and the best time is given above. All three tests completed within 5 seconds of the best time (so it is not a fluke).

Why is it faster to pipe the output to cat?

Edit

Does cat group input in bigger chunks? And/or does sort flush output for every line?

To test this I tried:

$ strace -ff sort -m <(sort 1) <(sort 2) 2>fromsort | cat >/dev/null $ strace -ff sort -m <(sort 1 | cat ) <(sort 2 | cat) 2>fromcat | cat >/dev/null

If cat made it into bigger chunks, we would expect read to return larger chunks. But it does not:

$ grep -E 'read|write' fromsort |field 1,5|sort | uniq -c 1 openat(AT_FDCWD, 3 8 pread64(3, = 1 read(3, 3771 40989 read(3, 4096 2 read(3, 832 1 read(3, unknown 1 read(4, 0 1 read(4, 2241 40959 read(4, 4096 1 write(1, 1916 81949 write(1, 4096 $ grep -E 'read|write' fromcat |field 1,5|sort | uniq -c 1 openat(AT_FDCWD, 3 8 pread64(3, = 1 read(3, 3771 40989 read(3, 4096 2 read(3, 832 1 read(3, unknown 1 read(4, 2241 40959 read(4, 4096 1 read(4, unknown 1 write(1, 1916 81949 write(1, 4096

In both cases both read and write are 4K.

(Incidentally, sort does read (much) bigger chunks if reading from a file and not from a pipe, but that is not the case here).

Edit 2

The goal of the above is to demonstrate that an additional cat is not always useless; and to figure out what causes this.

The goal is not to sort data.

But if your goal was to sort data, why not just use sort's built-in --parallel?

By default sort seems to use --parallel 8 on a 64-core machine. top shows it uses up to 800% CPU. You can force it to use 64 cores with --parallel 64:

$ time sort {1..64} | md5sum real 9m4.005s user 29m56.454s sys 5m49.560s $ time sort --parallel 64 {1..64} | md5sum real 6m50.332s user 35m55.040s sys 11m37.609s

So GNU sort's --parallel is much slower than the above. The above is now available as parsort: http://git.savannah.gnu.org/cgit/parallel.git/tree/src/parsort

You would have more luck if you tried with a less <censored> example. For starting sort -m fileA fileB is alternating blocking reads from fileA with those from fileB and adding another buffering layer (which a UUoC is doing, by putting another pipe in between) may cause it to always have what to read from the pipe, and not have to wait for the producer to put it there. But that is simply a fluke; the shells process substs and general purpose tools like sort are not appropriate for taking advantage of multiprocessing machines (even of my "4-core" phone, much less of a 64-core monster) — user313992
– user313992, Commented Oct 13, 2020 at 6:06
Aside from the cat issue, according to the docs, at least GNU sort should run parallel processes by default, and there's an option to change the number of processes it uses. Doesn't that built-in parallelising help? Or in other words, does it suck so bad that it's useful to do the parallelization by hand? — ilkkachu
– ilkkachu, Commented Oct 13, 2020 at 11:05
@ilkkachu The built-in parallelization does not scale to 64 cores (I think it scales to 4 or 8 cores), and it does not work at all when doing sort -m. The inner sorts could use the built-in parallelization, but they only run at 100% for ~15 secs. So: Yes, it sucks so bad. — Ole Tange
– Ole Tange, Commented Oct 13, 2020 at 16:58
Could you include enough instructions for someone else to replicate the results you're seeing? You say that the files are 160MB each, but don't indicate what's in them... presumably it's some amount of text, divided into sortable lines? — JigglyNaga
– JigglyNaga, Commented Oct 13, 2020 at 18:47
"sort, delete, copy" is more complicated than "sort". Yes, temp files should be removed when they're no longer needed -- but in this situation, you're investigating performance problems, so they're still needed for the next run. Why include extra, unnecessary steps, instead of showing a Minimal Working Example? — JigglyNaga
– JigglyNaga, Commented Oct 13, 2020 at 21:56

Philip Couling · Accepted Answer · 2025-03-31 08:26:41Z

This is by no means a "useless use of cat".

some_command | cat | some_command

This isn't a traditional "useless use of cat" which are usually derived from ignorance of the shell. Instead this appears to be a deliberate attempt to do something using the dynamics of cat. In this case I believe it's caching.

My Second thoughts

Even if the size of read and write isn't any different there are a couple of things which might be undetectable which could also be in play.

Firstly (and this is very important): Why is processing a sorted array faster than processing an unsorted array?. If you do anything to change the sequence in which the CPU is processing this, the timing may change. If cat succeeds in making each sort run for longer without suspending (and switching to a different process) then this could dramatically affect the CPU's branch prediction and result in a much larger or smaller time.

Secondly, even if the number and size of read is left unaffected number of times a task has to suspend (block) may be different. This in itself is likely to add or remove an overhead. So even if the reads and writes are of the same size, the cat (caching) layer might be reducing the number of times each read() and write() causes a process switch.

Cat might simply be forcing sort to wait longer and thus have more available to do without suspending and reducing the number of times each process blocks. This would be very difficult to detect.

My First thoughts

My expectation here would be that if you put both versions in their own script and ran strace -f on each script, you would see fewer read or / write calls in the example with cat. At least, I would expect to see much larger reads at each layer using cat. My expectation for sort would be that it writes single lines and doesn't internally buffer much. Indeed I would expect it to read() in large enough blocks but only write() in single lines. This means it's not well designed for piping to itself.

As laktak points out in his answer, cat reads in blocks of 128KB (see here) but pipes typically only buffer 64KB. If I'm right then when cat is suspended waiting for a read() to complete this will give a large (128 + 64 KB) buffer for the writing sort operation to write into without ever needing to suspend. By the time cat is resumed there will be a good chunk of data (much more than sort sent in a single write) to pass onto the next sort. As a result the next sort can read from this quite a lot without being suspended.

I also suspect that adding a layer of cat closest the files would have next to no impact or negative impact on performance. These files are already cached on your ram disk. But the layers in-between calls to sort will be acting as a buffer and should reduce the number. That is the truly "useless uses of cat" are those which use cat to read from a file. That's those of the form:

cat some_file | some_command

An interesting experiment

I would be interested to know if the same effect can be induced by increasing the buffer size on the pipes. If you setup the same pipeline from a proper programming language (not a shell). Eg in C you could create your pipeline using pipe(),dup2(),fork(),exec() and call ioctl() on each pipe first to raise the buffer size (see Pipe Capacity)

laktak · Accepted Answer · 2020-10-12 20:47:55Z

1

My guess is that your use of cat throttles the throughput of each individual command which in turn lets them run faster in parallel.

cat reads your data in chunks of 128KB. Since I have no way to reproduce your test could you try to replace your usage of cat with dd to prove me right or wrong?

dd status=none bs=128K should have the same effect as cat - try to increase/decrease the block size and compare the results.

answered Oct 12, 2020 at 20:47

laktak

6,37332 silver badges47 bronze badges

Why would it make them run faster in parallel to be throttled? Limiting throughput usually results in slower results. dd gives comparable results to cat. Slightly faster than cat with bs=2K: 2m13s.

Ole Tange
– Ole Tange

2020-10-12 21:48:42 +00:00
Commented Oct 12, 2020 at 21:48
Because then one process can't hog all the resouces - e.g. the other sorts would be waiting for input and doing nothing while one process would saturate the io.

laktak
– laktak

2020-10-12 21:57:53 +00:00
Commented Oct 12, 2020 at 21:57
What I/O are you talking about: This is a RAM disk. I can copy the 64 files (=10 GB) in less than 2 seconds if done in parallel. Also the inner sorts will be done sorting before they start sending data to the 63 remaining sorts.

Ole Tange
– Ole Tange

2020-10-12 22:09:07 +00:00
Commented Oct 12, 2020 at 22:09
When the inner sorts are done after around 15 secs, the CPU utilization drops to around 20%: 1 process at 100%, 2 at 50%, 4 at 25% and so on.

Ole Tange
– Ole Tange

2020-10-12 23:29:30 +00:00
Commented Oct 12, 2020 at 23:29
Yes, that's what I meant. The distribution of the resources between the processes isn't fair which is preventing full cpu utilization.

laktak
– laktak

2020-10-13 13:49:59 +00:00
Commented Oct 13, 2020 at 13:49

| Show 1 more comment

Stack Exchange Network

"Useless" use of 'cat' increases performance. Why?

2 Answers 2

You must log in to answer this question.

Linked

Hot Network Questions

"Useless" use of 'cat' increases performance. Why?

2 Answers 2

You must log in to answer this question.

Linked

Related

Hot Network Questions