0

I have a txt file contains a total of 10177 columns and a total of approximately 450,000 rows. The information is separated by tabs. I am trying to trim the file down using awk so that it only prints the 1-3, 5th, and every 14th column after the fifth one.

My file has a format that looks like:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ... 10177 A B C D E F G H I J K L M N O P Q R S T ... X Y X Y X Y X Y X Y X Y X Y X Y X Y X Y ... 

I am hoping to generate an output txt file (also separated with tab) that contains:

1 2 3 5 18 ... A B C E R ... X Y X X Y ... 

The current awk code I have looks like (I am using cygwin to use the code):

$ awk -F"\t" '{OFS="\t"} { for(i=5;i<10177;i+=14) printf ($i) }' test2.txt > test3.txt 

But the result I am getting shows something like:

123518...ABCER...XYXXY... 

When opened with excel program, the results are all mashed into 1 single cell.

In addition, when I try to include code

for (i=0;i<=3;i++) printf "%s ",$i 

in the awk to get the first 3 columns, it just prints out the original input document together with the mashed result. I am not familiar with awk, so I am not sure what causes this issue.

3 Answers 3

2

Awk field numbers, strings, and array indices all start at 1, not 0, so when you do:

for (i=0;i<=3;i++) printf "%s ",$i 

the first iteration prints $0 which is the whole record.

You're on the right track with:

$ awk -F"\t" '{OFS="\t"} { for(i=5;i<10177;i+=14) printf ($i) }' test2.txt > test3.txt 

but never do printf with input data as the only argument to printf since then printf will treat it as a format string without data (rather than what you want which is a plain string format with your data) and then that will fail cryptically if/when your input data contains formatting characters like %s or %d. So, always use printf "%s", $i, never printf $i.

The problem you're having with excel, I would guess, is you're trying to double click on the file and hoping excel knows what to do with it (it won't, unlike if this was a CSV). You can import tab-separated files into excel after it's opened though - google that.

You want something like:

awk ' BEGIN { FS=OFS="\t" } { for (i=1; i<=3; i++) { printf "%s%s", (i>1?OFS:""), $i } for (i=5; i<=NF; i+=14) { printf "%s%s", OFS, $i } print "" } ' file 

I highly recommend the book Effective Awk Programming, 4th Edition, by Arnold Robbins.

Sign up to request clarification or add additional context in comments.

Comments

1

In awk using conditional operator in for:

$ awk 'BEGIN { FS=OFS="\t" } { for(i=1; i<=NF; i+=( i<3 ? 1 : ( i==3 ? 2 : 14 ))) printf "%s%s", $i, ( (i+14)>NF ? ORS : OFS) }' file 1 2 3 5 19 A B C E S X Y X X X 

In the for if i<3 increment by one, if i==3 increment by two to get to 5 and after that by 14.

Comments

0

I would be tempted to solve the problem along the following lines. I think you'll find you save time by not iterating in awk.

$ cols="$( { echo 1 2 3; seq 5 14 10177; } | sed 's/^/$/; 2,$ s/^/, /' )" $ awk -F\\t "{print $cols}" test.txt 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.