Transforming text to tabular form

Question

I have a text file with the following structure:

aaa bbb ccc ddd eee fff 1 2 3 4 5 6 1.1 1.2 1.3 1.4 1.5 1.6 ggg hhh iii jjj kkk lll 7 8 9 10 11 12 2.1 2.2 2.3 2.4 2.5 2.6

and I want the following tabular structure:

aaa 1 1.1 bbb 2 1.2 ccc 3 1.3 ddd 4 1.4 eee 5 1.5 fff 6 1.6 ggg 7 2.1 hhh 8 2.2 iii 9 2.3 jjj 10 2.4 kkk 11 2.5 lll 12 2.6

In this example each column repeats the pattern 2 times but the actual file does it more times and has more fields.

so each block has 6 lines and you need three blocks aligned at a time? — iruvar
– iruvar, Commented Aug 30, 2013 at 17:15
I have a set of blocks of 6 lines all alligned (the first file), and I want a table with all the blocks of the same type in one column, for example the blocks of the type "xxx" were x is a letter, will go to the first column of the table. — Msegade
– Msegade, Commented Aug 30, 2013 at 17:21
so your types are 1. xxx where x is a letter 2. integers and 3. a.b where a and b are integers? — iruvar
– iruvar, Commented Aug 30, 2013 at 17:24
In this example yes, but in the real file there are other types that can be of the same form, for example integers. — Msegade
– Msegade, Commented Aug 30, 2013 at 17:30

iruvar · Accepted Answer · 2013-08-30 18:02:49Z

paste should be able to do the job. Here x.1 is the name of the file

paste <(grep -E '^[[:alpha:]]+$' x.1) \ <(grep -E '^[[:digit:]]+$' x.1) \ <(grep -E '^[[:digit:]]+[.][[:digit:]]+$' x.1)

A thing of beauty. I knew there was a paste way to do this, thanks for showing us how!!! — slm
– slm ♦, Commented Aug 30, 2013 at 18:45

Stéphane Chazelas · Accepted Answer · 2013-08-30 19:45:42Z

4

You could do:

mkfifo 0 1 2 awk -v RS= '{print > NR%3}' < file & paste 1 2 0

There's potential for deadlock if any of the paragraphs are larger than the pipe buffer (64k on Linux).

answered Aug 30, 2013 at 19:45

Stéphane Chazelas

587k96 gold badges1.1k silver badges1.7k bronze badges

2

@slm, the main magic is in RS=. From the gawk documentation, " By a special dispensation, an empty string as the value of RS indicates that records are separated by one or more blank lines"

iruvar
– iruvar

2013-08-30 21:53:22 +00:00
Commented Aug 30, 2013 at 21:53
2

MIND BLOWN! Starting to sink in, the print > NR%3 is the key piece where awk is splitting up the records into one of the 3 fifos, (1, 2, or 0). Paste is then tasked with assembling the contents of these 3 fifos.

slm
– slm ♦

2013-08-30 21:56:07 +00:00
Commented Aug 30, 2013 at 21:56

Add a comment |

Barun · Accepted Answer · 2013-08-30 19:36:39Z

Considering four types of data -- 1) alphabets, 2) integers, 3) floating point numbers and 4) alphanumerics, the following awk script does the job.

/^[a-zA-Z]+$/ { alphabets[ia++] = $1; n++; } /[a-zA-Z]+[0-9]+[a-zA-Z0-9]*/ || /[0-9]+[a-zA-Z]+[a-zA-Z0-9]*/ { alphanumerics[an++] = $1; } /[0-9]+[.][0-9]+/ { floats[f++] = $1; } /^[0-9]+$/ { integers[k++] = $1; } END { for (i = 0; i < n; i++) { print alphabets[i], integers[i], floats[i], alphanumerics[i]; } }

Save the above code in a file say, table.awk, and execute as

awk -f table.awk input_text_file

In particular, the blocks of the above mentioned "data types" can appear in any order in the input file. The output obtained with the sample data and six alphanumeric values is as follows:

aaa 1 1.1 a1 bbb 2 1.2 b2 ccc 3 1.3 c3 ddd 4 1.4 d4 eee 5 1.5 e55 fff 6 1.6 6fF ggg 7 2.1 hhh 8 2.2 iii 9 2.3 jjj 10 2.4 kkk 11 2.5 lll 12 2.6

@1_CR Ah, you can skip it! Just copied from some previous code :P Edited. — Barun
– Barun, Commented Aug 30, 2013 at 18:25
It should work for skipping mpty lines which is never a bad idea. — terdon
– terdon ♦, Commented Aug 30, 2013 at 18:36
@terdon Exactly! I usually begin in that way, but eventually found not required in this case. — Barun
– Barun, Commented Aug 30, 2013 at 18:41
@Barun, !NF is the more idiomatic way of skipping blank lines with awk — iruvar
– iruvar, Commented Aug 30, 2013 at 20:35

Kaz · Accepted Answer · 2015-12-10 00:18:14Z

Using a single TXR Lisp expression, based on a pipeline of higher order functions and partial application, and a quasiliteral string for formatting the fixed-width fields:

$ txr -e '[(opip (partition* @1 (op where (op equal ""))) (tuples 3) (reduce-left (op mapcar append)) (apply mapdo (op pprinl `@{1 6} @{2 6} @{3 6}`))) (get-lines)]' < data aaa 1 1.1 bbb 2 1.2 ccc 3 1.3 ddd 4 1.4 eee 5 1.5 fff 6 1.6 ggg 7 2.1 hhh 8 2.2 iii 9 2.3 jjj 10 2.4 kkk 11 2.5 lll 12 2.6

How it works

Overall the whole expression has the form [function argument]. The argument is (get-lines), which snarfs lines from a stream and returns a (lazy) list of strings. The stream defaults to *stdin*. The function is constructed by the (opip ...) macro, and that's where all the action happens.

To understand opip, we have to know op, which opip uses implicitly: it stands for "op pipeline"). Also, op is used explicitly in a few places. In a nutshell, (op function args ...) is a syntactic sugar for creating an anonymous function which calls function, and cooks some of the arguments. Within args ..., the anonymous function's arguments can be referenced by number. The anonymous function also implicitly takes trailing arguments. For instance (op + 3) denotes an anonymous function which adds its arguments together, and adds 3. (op - @1 3) is an anonymous function which subtracts 3 from its argument. The syntax @1 denotes the insertion of the functions first argument into the given position in the expression. (op mapcar append) is a function to which we can pass a bunch of lists, each of which contains lists. The function will take these lists tuple-wise and append them together. This is the basis for the paste-like logic for joining the data.

The opip macro takes a bunch of expressions and essentially inserts op into them, and then creates a function which pipes data through the resulting anonymous functions. That's a simplification, but it will do.

(partition* @1 (op where (op equal ""))) breaks up the list of raw lines from the file into partitions based on cutting the list where its elements are blank lines (equal to the empty string), and removing those entries. (The partition function without the * in its name will leave those blanks in place).

(tuples 3) gathers up these partitions into groups of 3.

These groups of 3, or triplets, have to be accumulated together in parallel: the first elements of the triplets have to be appended into a single list, the second elements into a single list and so on. That is the job of (reduce-left (op mapcar append)). The kernel function (op mapcar append) is given a pair of triplets, and catenates their corresponding entries together to create a merged triplet. The reduce-left function decimates the list of triplets through this down to a single triplet.

This master triplet is then applied as arguments to a mapdo call, in the final expression (apply mapdo (op pprinl ...)). mapdo receives a function as its leftmost argument, generated by (op ...) once again. The remaining arguments are the three elements of the giant triplet, representing the three columns of data. The columns are mapped row by row through the anonymous function.

The anonymous function takes three arguments which are referenced in the quasiliteral string `@{1 6} @{2 6} @{3 6}`, where @{1 6} means @1, set in a field width of 6. This string quasiliteral which interpolates the three arguments (three elements pulled pairwise by mapdo from the triplet of columns) constructs a string which the anonymous function passes to pprinl, which prints it with a newline.

Stack Exchange Network

Transforming text to tabular form

4 Answers 4

How it works

You must log in to answer this question.

Hot Network Questions

Transforming text to tabular form

4 Answers 4

How it works

You must log in to answer this question.

Related

Hot Network Questions