paste is a brilliant tool, but it is dead slow: I get around 50 MB/s on my server when running:
paste -d, file1 file2 ... file10000 | pv >/dev/null paste is using 100% CPU according to top, so it is not limited by, say, a slow disk.
Looking at the source code it is probably because it uses getc:
while (chr != EOF) { sometodo = true; if (chr == line_delim) break; xputchar (chr); chr = getc (fileptr[i]); err = errno; } Is there another tool that does the same, but which is faster? Maybe by reading 4k-64k blocks at a time? Maybe by using vector instructions for finding the newline in parallel instead of looking at a single byte at a time? Maybe using awk or similar?
The input files are UTF8 and so big they do not fit in RAM, so reading everything into memory is not an option.
Edit:
thanasisp suggests running jobs in parallel. That improves throughput slightly, but it is still a magnitude slower than pure pv:
# Baseline $ pv file* | head -c 10G >/dev/null 10.0GiB 0:00:11 [ 897MiB/s] [> ] 3% # Paste all files at once $ paste -d, file* | pv | head -c 1G >/dev/null 1.00GiB 0:00:21 [48.5MiB/s] [ <=> ] # Paste 11% at a time in parallel, and finally paste these $ paste -d, <(paste -d, file1*) <(paste -d, file2*) <(paste -d, file3*) \ <(paste -d, file4*) <(paste -d, file5*) <(paste -d, file6*) \ <(paste -d, file7*) <(paste -d, file8*) <(paste -d, file9*) | pv | head -c 1G > /dev/null 1.00GiB 0:00:14 [69.2MiB/s] [ <=> ] top still shows that it is the outer paste that is the bottleneck.
I tested if increasing the buffer made a difference:
$ stdbuf -i8191 -o8191 paste -d, <(paste -d, file1?) <(paste -d, file2?) <(paste -d, file3?) <(paste -d, file4?) <(paste -d, file5?) <(paste -d, file6?) <(paste -d, file7?) <(paste -d, file8?) <(paste -d, file9?) | pv | head -c 1G > /dev/null 1.00GiB 0:00:12 [80.8MiB/s] [ <=> ] This increased throughput 10%. Increasing the buffer further gave no improvement. This is likely hardware dependent (i.e. it may be due to the size of level 1 CPU cache).
Tests are run in a RAM disk to avoid limitations related to the disk subsystem.

getc()doesn't necessarily cause data to be read from disk if the data already has been read into a buffer by the I/O library (which may well read 4 KB chunks).awk, that's terrible slower thanpasteitself even if you define a customRSthat never exist in your file in order to tell awk whole file is a single record,awkwill still scan whole file from top to down.pasteis generic and it does need to check for the end-of-line character (it does not read one character at time from disk though, there are underlying buffers). Whatever optimization you can do, it will be probably related to specific formatting/content of the input files. Could you share more info on them? Maybe an example?