AWK: Print lines matching a pattern

Question

I have a tab separated file where the last fifteen fields are formed of zeros and ones. What it's need to do is print lines that do not contain more than five consecutive zeros or more than five consecutive ones, between those fifteen fields separated by groups of five fields.

File:

abadenguísimo abadenguísimo adjective n/a n/a singular n/a masculine 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 abalaustradísimo abalaustradísimo adjective n/a n/a singular n/a masculine 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 abiertísimas abiertísimo adjective n/a n/a plural n/a feminine 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 abellacadísimo abellacadísimo adjective n/a n/a singular n/a masculine 1 0 1 1 1 0 0 1 0 0 1 0 0 0 0 cansonísimos cansonísimo adjective n/a n/a plural n/a masculine 0 1 1 1 0 0 0 0 1 0 0 0 0 0 1

Output:

abellacadísimo abellacadísimo adjective n/a n/a singular n/a masculine 1 0 1 1 1 0 0 1 0 0 1 0 0 0 0 cansonísimos cansonísimo adjective n/a n/a plural n/a masculine 0 1 1 1 0 0 0 0 1 0 0 0 0 0 1

I tried this:

BEGIN { FS = "\t" } { a=0; b=0; c=0; num[A]=""; num[B]=""; num[C]=""; for ( i = 9; i <= 13; i++) num[A]=num[A]""$i; for (j = 14; j <= 18; j++) num[B]=num[B]""$j; for (k = 19; k <= 23; k++) num[C]=num[C]""$k; if ((num[A] != "00000") && (num[A] != "11111")) { a=1; } if (num[B] != "00000") { b=1; } if (num[C] != "00000") { c=1; } if ((a == 1) || (b == 1) || (c == 1)) { print; } }

Finally I think I've found a solution, I don't know why the other code doesn't work for me.

BEGIN { FS = "\t" cont=0; } { a=0; b=0; c=0; sum1=$9+$10+$11+$12+$13; sum2=$14+$15+$16+$17+$18; sum3=$19+$20+$21+$22+$23; if (( sum1 > 0 ) && ( sum1 < 5 )) { a=1; } if ( sum2 > 0 ) { b=1; } if ( sum3 > 0 ) { c=1; } if ((a == 1) || (b == 1) || (c == 1)) { cont++; print; } } END { print "Total: "NR; print "OK: "cont; }

If I'm understanding your code correctly, the three groups of five are important? So you would not want to print a line with 00000 11111 00000, but you WOULD want to print a line with 00011 11100 01010. Is that correct? — ghoti
– ghoti, Commented Nov 2, 2015 at 12:57

Community · Accepted Answer · 2020-06-20 09:12:55Z

1

if you translate your requirement from english into regex then give to grep, it will do what you want:

grep -vE '(1\s+){6,}|(0\s+){6,}' file

You can adjust the \s+, for example change it to \t or something else for your needs.

Update

awk -F'\t' '{s=NF-15+1 c=i=0 while(++c<=3){ x=i?i:s t=0 for(i=x;i<x+5;i++) t+=$i+0 if(t==0||t==5) next } print }' file

This give your the expected output. It checks the "more than FOUR consecutive zeros/ones" instead of FIVE, because each group has max. 5 elements/columns, ">5" will never happen.

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Nov 2, 2015 at 12:34

Kent

197k36 gold badges248 silver badges317 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Polucho Over a year ago

If you divide the last fifteen fields into three, each of these groups can't have more than five ones or more than five zeros. Thank you.

Kent Over a year ago

@Criatos oh you just want to check each "group" in the last 15 fields? E.g. in this way (5 cols)(5 cols)(5 cols)?

Polucho Over a year ago

that it's what i want

ghoti Over a year ago

That's why including code in the question is important. :-)

Kent Over a year ago

@Criatos if you divide 15 cols into 3 groups, how come one group that having more than five consecutive zeros or more than five consecutive ones? there are totally 5 fields in each group.

|

bian · Accepted Answer · 2015-11-02 14:09:36Z

awk 4

awk 'split($0,t,/(1 +){6,}|(0 +){6,}/)<2' file

awk 3.1

awk --posix 'split($0,t,/(1 +){6,}|(0 +){6,}/)<2' file

update

awk '{for(i=9;i<=NF;i++){a[$i];if(++c==5){l=length(a);delete a;c=0;if(l>1){print;break}}}}' file

If you divide the last fifteen fields into three, each of these groups can't have more than five ones or more than five zeros. Thank you.

ghoti · Accepted Answer · 2015-11-02 19:12:45Z

The following ERE in grep works with your input data, where ALL THREE groups of five have matching content:

egrep -v '(\s+[01])\1\1\1\1(\s+[01])\2\2\2\2(\s+[01])\3\3\3\3' file

Since your question is tagged awk, though, let's express this in awk.

We can't do the same thing in awk, because awk traditionally does not support backreferences in regular expressions. So as your script suggests, doing this programmatically may be the answer. Your solution concatenates fields and compares strings. I think I would probably use arithmetic instead -- a sum of the five fields is a number from zero to five. A value of zero or five means "skip", anything else means "print".

#!/usr/bin/awk -f { # Count back from the end in groups of five, until we hit e field # that is neither "0" nor "1"... start=NF; while ($start ~ /^[01]$/) { group++; for(i=start;i>start-5;i--) { sum[group]+=$i; } start=i; } # Step through groups, adding a condition to a counter. # At the end of the loop, if found > 0, then we've found a line # that does not have the pattern specified. found=0; while (--group) { found+=(sum[group] > 1 && sum[group] < 5); } } # If found > 0, print the line. found

Collectives™ on Stack Overflow

AWK: Print lines matching a pattern

3 Answers 3

Update

7 Comments

1 Comment

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Update

7 Comments

1 Comment

Comments

Related