Pipe a lot of files to stdin, extract first columns, then combine those in a new file

Question

Suppose we have these two files:

$ cat ABC.txt ABC DEF $ cat PQR.txt PQR XTZ

And we want to form a new file with the 1st column of each file. This can be achieved by:

$ paste -d ' ' <(cut -d ' ' -f 1 ABC.txt) <(cut -d ' ' -f 1 PQR.txt ) ABC PQR

But I want to use this with tons of files in the input, not only ABC.txt and PQR.TXT, but a lot of them. How we can generalize this situation to pass each file in the collection to cut and then pass all the outputs to paste (I know that this may be done better with awk but I want to know how to solve this using this approach).

Edit 1

I have discovered a dirty, dirty way of doing this:

$ str=''; for i in *.txt; \ do str="${str} <(cut -d ' ' -f 1 ${i})"; \ done ; \ str="paste -d ' ' $str"; \ eval $str

But please, free my soul with an answer that does not involve going to Computer Science Hell.

Edit 2

Each file can have n rows, if this matters.

do you have only one row for each file?

karakfa
– karakfa

2016-04-21 18:28:09 +00:00
Commented Apr 21, 2016 at 18:28 — karakfa
– karakfa, Commented Apr 21, 2016 at 18:28
No, each file have n rows.

Dargor
– Dargor

2016-04-21 18:28:54 +00:00
Commented Apr 21, 2016 at 18:28 — Dargor
– Dargor, Commented Apr 21, 2016 at 18:28

that other guy · Accepted Answer · 2016-04-22 21:18:17Z

Process substitution <(somecommand) doesn't pipe to stdin, it actually opens a pipe on a separate file descriptor, e.g. 63, and passes in /dev/fd/63. When this "file" is opened, the kernel* duplicates the fd instead of opening a real file.

We can do something similar by opening a bunch of file descriptors and then passing them to the command:

# Start subshell so all files are automatically closed ( fds=() n=0 # Open a new fd for each process subtitution for file in ./*.txt do exec {fds[n++]}< <(cut -d ' ' -f 1 "$file") done # fds now contain a list of fds like 12 14 # prepend "/dev/fd/" to all of them parameters=( "${fds[@]/#//dev/fd/}" ) paste -d ' ' "${parameters[@]}" )

{var}< file is bash's syntax for dynamic file descriptor assignment. like var=4; exec 4< file; but without having to hardcode the 4 and instead let bash pick a free file descriptor. exec opens it in the current shell.

* Linux, FreeBSD, OpenBSD and XNU/OSX anyways. This is not POSIX, but neither is <(..)

Nicely done; it's worth mentioning that the {var} method of defining file descriptors requires Bash 4.1+.
Thanks for the great answer! By the way i suppose you mean var=4; exec 4< file;, isn't it?

agc · Accepted Answer · 2016-04-23 00:39:10Z

1

Given space delimited input files, and provided ':' is a safe delimiter, (i.e. if there are no colons in the input), this paste to sed one-liner works:

paste -d':' *.txt | sed 's/ [^:]*$//;s/ [^:]*:*/ /g;s/://g'

(POSIX, with no eval, exec, bashisms, subshells, or loops.)

edited Apr 23, 2016 at 0:39

answered Apr 22, 2016 at 16:58

agc

8,5342 gold badges33 silver badges53 bronze badges

6 Comments

webb Over a year ago

@that-other-guy's and my answers are ~50x as fast as this (tested with 3 10,000,000-line .txt files).

agc Over a year ago

@webb, that's cool, but didn't the OP say he was testing a lot of little files, rather than a few big files? A benchmark for 10,000,000 3-line text files might be more relevant.

webb Over a year ago

interesting point. your answer is 50x faster for 2000 single-line files, e.g., 40,000 files/second vs 800 files/second for @that-other-guy's and my answers! additionally, all three answers fail completely for e.g. 3000 (or more) files.

agc Over a year ago

@webb, that's a surprise, 50x on opposite sides... a good illustration of algorithms for file length vs. file size. If time permits, do provide more detail in what way all three answers "fail completely" at some point. Did they slow down, return wrong answers, or what?

webb Over a year ago

they fail because of too many open files (yours) or file descriptors (mine & @that-other-guy's).

|

webb · Accepted Answer · 2016-05-03 18:16:15Z

After a closer look, I see that @that-other-guy's answer is awesome, but here also is another dirty dirty way that's roughly the same under the hood.

eval "paste -d' ' "$(find *.txt -printf " <(cut -d' ' -f1 '%f')")

Collectives™ on Stack Overflow

Pipe a lot of files to stdin, extract first columns, then combine those in a new file

3 Answers 3

2 Comments

6 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

6 Comments

Comments

Related