Print the 1st and every nth column of a text file using awk

Question

I have a txt file contains a total of 10177 columns and a total of approximately 450,000 rows. The information is separated by tabs. I am trying to trim the file down using awk so that it only prints the 1-3, 5th, and every 14th column after the fifth one.

My file has a format that looks like:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ... 10177 A B C D E F G H I J K L M N O P Q R S T ... X Y X Y X Y X Y X Y X Y X Y X Y X Y X Y ...

I am hoping to generate an output txt file (also separated with tab) that contains:

1 2 3 5 18 ... A B C E R ... X Y X X Y ...

The current awk code I have looks like (I am using cygwin to use the code):

$ awk -F"\t" '{OFS="\t"} { for(i=5;i<10177;i+=14) printf ($i) }' test2.txt > test3.txt

But the result I am getting shows something like:

123518...ABCER...XYXXY...

When opened with excel program, the results are all mashed into 1 single cell.

In addition, when I try to include code

for (i=0;i<=3;i++) printf "%s ",$i

in the awk to get the first 3 columns, it just prints out the original input document together with the mashed result. I am not familiar with awk, so I am not sure what causes this issue.

Ed Morton · Accepted Answer · 2016-12-08 02:51:17Z

Awk field numbers, strings, and array indices all start at 1, not 0, so when you do:

for (i=0;i<=3;i++) printf "%s ",$i

the first iteration prints $0 which is the whole record.

You're on the right track with:

$ awk -F"\t" '{OFS="\t"} { for(i=5;i<10177;i+=14) printf ($i) }' test2.txt > test3.txt

but never do printf with input data as the only argument to printf since then printf will treat it as a format string without data (rather than what you want which is a plain string format with your data) and then that will fail cryptically if/when your input data contains formatting characters like %s or %d. So, always use printf "%s", $i, never printf $i.

The problem you're having with excel, I would guess, is you're trying to double click on the file and hoping excel knows what to do with it (it won't, unlike if this was a CSV). You can import tab-separated files into excel after it's opened though - google that.

You want something like:

awk ' BEGIN { FS=OFS="\t" } { for (i=1; i<=3; i++) { printf "%s%s", (i>1?OFS:""), $i } for (i=5; i<=NF; i+=14) { printf "%s%s", OFS, $i } print "" } ' file

I highly recommend the book Effective Awk Programming, 4th Edition, by Arnold Robbins.

James Brown · Accepted Answer · 2016-12-08 06:37:20Z

In awk using conditional operator in for:

$ awk 'BEGIN { FS=OFS="\t" } { for(i=1; i<=NF; i+=( i<3 ? 1 : ( i==3 ? 2 : 14 ))) printf "%s%s", $i, ( (i+14)>NF ? ORS : OFS) }' file 1 2 3 5 19 A B C E S X Y X X X

In the for if i<3 increment by one, if i==3 increment by two to get to 5 and after that by 14.

James K. Lowden · Accepted Answer · 2016-12-08 04:50:59Z

I would be tempted to solve the problem along the following lines. I think you'll find you save time by not iterating in awk.

$ cols="$( { echo 1 2 3; seq 5 14 10177; } | sed 's/^/$/; 2,$ s/^/, /' )" $ awk -F\\t "{print $cols}" test.txt

Collectives™ on Stack Overflow

Print the 1st and every nth column of a text file using awk

3 Answers 3

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Linked

Related