4

paste is a brilliant tool, but it is dead slow: I get around 50 MB/s on my server when running:

paste -d, file1 file2 ... file10000 | pv >/dev/null 

paste is using 100% CPU according to top, so it is not limited by, say, a slow disk.

Looking at the source code it is probably because it uses getc:

 while (chr != EOF) { sometodo = true; if (chr == line_delim) break; xputchar (chr); chr = getc (fileptr[i]); err = errno; } 

Is there another tool that does the same, but which is faster? Maybe by reading 4k-64k blocks at a time? Maybe by using vector instructions for finding the newline in parallel instead of looking at a single byte at a time? Maybe using awk or similar?

The input files are UTF8 and so big they do not fit in RAM, so reading everything into memory is not an option.

Edit:

thanasisp suggests running jobs in parallel. That improves throughput slightly, but it is still a magnitude slower than pure pv:

# Baseline $ pv file* | head -c 10G >/dev/null 10.0GiB 0:00:11 [ 897MiB/s] [> ] 3% # Paste all files at once $ paste -d, file* | pv | head -c 1G >/dev/null 1.00GiB 0:00:21 [48.5MiB/s] [ <=> ] # Paste 11% at a time in parallel, and finally paste these $ paste -d, <(paste -d, file1*) <(paste -d, file2*) <(paste -d, file3*) \ <(paste -d, file4*) <(paste -d, file5*) <(paste -d, file6*) \ <(paste -d, file7*) <(paste -d, file8*) <(paste -d, file9*) | pv | head -c 1G > /dev/null 1.00GiB 0:00:14 [69.2MiB/s] [ <=> ] 

top still shows that it is the outer paste that is the bottleneck.

I tested if increasing the buffer made a difference:

$ stdbuf -i8191 -o8191 paste -d, <(paste -d, file1?) <(paste -d, file2?) <(paste -d, file3?) <(paste -d, file4?) <(paste -d, file5?) <(paste -d, file6?) <(paste -d, file7?) <(paste -d, file8?) <(paste -d, file9?) | pv | head -c 1G > /dev/null 1.00GiB 0:00:12 [80.8MiB/s] [ <=> ] 

This increased throughput 10%. Increasing the buffer further gave no improvement. This is likely hardware dependent (i.e. it may be due to the size of level 1 CPU cache).

Tests are run in a RAM disk to avoid limitations related to the disk subsystem.

5
  • If I'm not mistaken, getc() doesn't necessarily cause data to be read from disk if the data already has been read into a buffer by the I/O library (which may well read 4 KB chunks). Commented Nov 25, 2020 at 13:04
  • don't count on awk, that's terrible slower than paste itself even if you define a custom RS that never exist in your file in order to tell awk whole file is a single record, awk will still scan whole file from top to down. Commented Nov 26, 2020 at 11:21
  • paste is generic and it does need to check for the end-of-line character (it does not read one character at time from disk though, there are underlying buffers). Whatever optimization you can do, it will be probably related to specific formatting/content of the input files. Could you share more info on them? Maybe an example? Commented Nov 28, 2020 at 20:53
  • @EduardoTrápani The content is UTF-8 text files. No other restrictions. Commented Nov 29, 2020 at 1:12
  • I don't think that this could be done faster by one process, keeping 10K files open, and reading them. Also, the standard text-processing tools (awk/sed etc) are expected to be slower, they could be faster only for a few files, loading all lines to memory, paste is fast using no memory. Just an idea, that you maybe have already: some combination of paste commands to run in parallel and paste incrementally, could be a bit faster, it depends on the box maybe (like: paste 100 streams of paste commands on 100 files each). Commented Dec 1, 2020 at 9:47

3 Answers 3

3

Update in Jan 2025:

  • Retested on same machine after major OS update
  • python bumped from 3.9 to 3.12
  • added AWK and Perl versions
  • NIM version compiled with 1.6.2 and 2.2.0 for comparison

tl;dr:

  1. yes, coreutils paste is far slower than cat
  2. there seems no easily available alternative that is uniformly faster than coreutils paste, in particular not for lots of short lines and many files.
  3. paste is amazingly stable in throughput across different combinations of line length, number of lines and number of files
  4. faster alternatives for longer lines are provided below ("ancient" Perl and AWK once again prove their value)

In Detail:

I tested quite a number of scenarios. Throughput measurement was done using pv as in the original post.

Compared Programs:

  1. cat (from GNU coreutils 8.25 9.1 used as baseline and not being part of the competition)
  2. paste (also from GNU coreutils 8.25 9.1)
  3. python script from answer above
  4. alternative python script (replacing list comprehension for collecting line fragments by regular loop, below caled python2.py)
  5. nim program (akin to 4. but compiled executable paste.nim below is compiled with nim v 1.6.2, paste2.nim is compiled with nim 2.2, both using the -d:release flag)
  6. Perl version running on Perl 5.34.0
  7. Two AWK versions, one printing every part as soon as it is available, the other first combining all parts to one line and then printing it as a whole. Both were run on GNU awk 5.1.0

File / Line number combinations:

# columns lines
1 200,000 1,000
2 20,000 10,000
3 2,000 100,000
4 200 1,000,000
5 20 10,000,000
6 2 100,000,000

Total amount of data was the same in each test (1.3GB). Each column consisted of 6-digit numbers (e.g. 000'001 to 200'000). Above combinations were distributed across 1, 10, 100, 1'000, and 10'000 equally sized files as far as possible.

Files where generated like: yes {000001..200000} | head -1000 > 1

Pasting was done like: for i in cat paste ./paste ./paste2 ./paste3; do $i {00001..1000} | pv > /dev/null; done

However, the files pasted were actually all links to the same original file, so all data should be in cache anyway (created directly before pasting and read with cat first; system memory is 128GB, cache size 34GB)

An additional set was run, where data where created on the fly instead of reading from pre-created files and piped into paste (denoted below with number of files=0).

For the last set the command was like for i in cat paste ./paste ./paste2 ./paste3; do $i <(yes {000001..200000} | head -1000) | pv > /dev/null; done

Findings:

  1. paste is an order of magnitude slower than cat
  2. paste's throughput is extremely consistent (~300MB/s) across a wide range of line widths, and number of files involved.
  3. Home grown python alternatives can show some advantage as soon as average input file line length is above a certain limit (~200 characters/line on my test machine).
  4. No need for nim any more: Python Version 3.12 delivers an amazing speed up compared to 3.9 and is on par with nim 1.6. Nim v2.2 produces even slower code than nim 1.6 (at least w.r.t. this specific task) Compiled nim version has about double throughput compared to python scripts. Point of break even in comparison with paste is ~500 characters for one input file. This decreases with growing number of input files, down to ~150 characters per input file line as soon as at least 10 input files are involved.
  5. Both, python and nim versions, All presented alternatives suffer from processing overhead for many short lines (suspected reason: a) in both the stdlib functions used try to detect line endings and convert them to platform specific endings; b) looping is done in the high level languages). coreutils paste however is not affected.
  6. Seemingly the simultaneous on-the-fly data generation process was the limiting factor for cat, as well as for the nim other programs with longer lines, and also affected processing speed to some extent.
  7. At some point the multitude of open file handles seems to have a detrimental impact even on coreutils paste. (Just speculating: Maybe this could even affect the parallel version?)
  8. One does not want to use the AWK version for scenarios involving very many very narrow files. Performance is unbearable.
  9. With AWK combining all the pieces for every output line in memory before printing, instead of printing every single piece separately is important, despite that it lacks a convenient function to join a list of strings.
  10. For situations with a reasonable number of files and somewhat longer lines, both ancient string processing tools AWK and Perl (in particular the latter) completely blew every other competitor out of the water, reducing runtime to as low as 30% compared to coreutils paste.

enter image description here

Conclusion (at least for test machine)

  1. If input files are narrow use coreutils paste, in particular when files are very long.

  2. If input files are rather wide, prefer alternative (Input file line length > 1400 characters for python versions, 150-500 characters for nim version depending on number of input files).

  3. Generally prefer Perl compiled nim version over python scripts everything else if lines are not too short. AWK version running on GNU awk is close 2nd in part of the situations

  4. Beware of too many small fragments. The default soft limit of 1024 open files for a process seems quite reasonable in this context.

Suggestion for OP's situation (parallel processing)

If input files are very narrow and many, use coreutils paste in the inner jobs and Perl or AWK versions compiled alternative for the outermost process. If all files have long lines use nim Perl or AWK versions generally.

Cave: the linked programs are ad-hoc versions, provided as they are, without any guarantees and without explicit error handling. Also separator is hard coded in all three implementations.

I have yet to try newer compiled alternatives like the Rust rewrite of the coreutils

1

Despite getc being highly optimized and nowadays mostly implemented as a buffer backed macro, I'd agree that it may still be the bottleneck - just as you suspect, or rather: the possibly comparably small size of the used buffer, resulting in still a high number of file reads.

While I did not get the exact numbers you showed above, in my tests there is still a marked difference between baseline and paste runs.

(Maybe the culprit is the switches in buffering (blocks for the files, lines for the stream). However I'm not that experienced w.t.r.)

Testing further I got a similar drop in throughput when using the following construct:

cat file* | dd | pv > /dev/null 

Throughput with dd inserted in between was very similar to that of the paste run. (dd by default uses block size of 512 bytes). If further reducing the buffer size, running time increases proportionally. When however increasing the block size to only a few kb (e.g. 8 or 16), speed increased dramatically. When using 1 or 2M it took off:

cat file* | dd bs=2M | pv > /dev/null 

There seems to be a possibility to change buffer size for getc (in case of stdout switch from line to block buffering with freely selected size). However one has to keep in mind that with several 1000s of open files and a buffer for each of them memory requirement goes up quickly.

Nonetheless, one could try to change the buffering (see e.g. https://stackoverflow.com/questions/66219179/usage-of-getc-with-a-file) by using an appropriate setvbuf call and see what will happen.

Addition: Presently I'm not aware of a comparable and publicly available program being faster for the same task.

(P.S.: Just read your name - aren't you the GNU parallel guy? Great piece of work!)

1

As I have experienced python text processing as comparably fast in the past (i.e. python text processing routines highly optimized and heavily tuned), I just cobbled together a small q&d python script and compared it with paste from coreutils. (Of course this thing is very limited w.r.t. its options as it's only for demonstration purposes (only accepting file names of present files), and column separator is hard wired.)

Put this into a textfile called paste in your current directory, provide execute permissions and give it a try.

#! /usr/bin/env python3 import sys filenames = sys.argv[1:] infiles = [open(i,'r') for i in filenames] while True: lines = [i.readline() for i in infiles] if all([i=='' for i in lines]): break print("\t".join([i.strip("\n") for i in lines])) 

Update: Fixed a bug in the above script that caused it to abort as soon as the same line was found empty in all input files. Timing measurement for the python version also updated below to reflect actual run time.

On several systems I tested, after warming up the cache (did not bother creating a ramdisk) the python version above consistently outperformed coreutils paste by a margin of 10% (Debian Buster native) to 30% (VM running Debian Stretch). On Windows the difference is even more pronounced, that however may be due to additional posix translation overhead or different caching (cygwin with coreutils paste 8.26: >20x slower; msys2 with coreutils paste 8.32: >12x slower than python version; both running python 3.9.9). For these tests I just created a file with 100 very long lines and pasted it to itself, as it seems that the problematic thing is the handling of long lines.

jf1@s1 MSYS /d/temp/ui # time paste b b > /dev/null real 0m5.920s user 0m5.896s sys 0m0.031s jf1@s1 MSYS /d/temp/ui # time ../paste b b > /dev/null real 0m0.480s user 0m0.295s sys 0m0.170s jf1@s1 MSYS /d/temp/ui # ../paste b b > c1 jf1@s1 MSYS /d/temp/ui # paste b b > c2 jf1@s1 MSYS /d/temp/ui # diff c1 c2 jf1@s1 MSYS /d/temp/ui # 
4
  • 1
    In my parallelized test the above is twice as fast as coreutils' paste on my 64 core system. Not pv performance, but clearly an improvement. Commented Jan 2, 2022 at 18:01
  • Did you exchange all instances of paste in this test or only the outermost? Commented Jan 6, 2022 at 15:29
  • I exchanged all. The outer one is currently using 100% CPU. Commented Jan 6, 2022 at 23:48
  • 1
    I can squeeze a few extra MB/s by putting more files in each inner pastes. I can get around 100 MB/s. Commented Jan 6, 2022 at 23:54

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.