0

I have a tab separated file where the last fifteen fields are formed of zeros and ones. What it's need to do is print lines that do not contain more than five consecutive zeros or more than five consecutive ones, between those fifteen fields separated by groups of five fields.

File:

abadenguísimo abadenguísimo adjective n/a n/a singular n/a masculine 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 abalaustradísimo abalaustradísimo adjective n/a n/a singular n/a masculine 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 abiertísimas abiertísimo adjective n/a n/a plural n/a feminine 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 abellacadísimo abellacadísimo adjective n/a n/a singular n/a masculine 1 0 1 1 1 0 0 1 0 0 1 0 0 0 0 cansonísimos cansonísimo adjective n/a n/a plural n/a masculine 0 1 1 1 0 0 0 0 1 0 0 0 0 0 1 

Output:

abellacadísimo abellacadísimo adjective n/a n/a singular n/a masculine 1 0 1 1 1 0 0 1 0 0 1 0 0 0 0 cansonísimos cansonísimo adjective n/a n/a plural n/a masculine 0 1 1 1 0 0 0 0 1 0 0 0 0 0 1 

I tried this:

BEGIN { FS = "\t" } { a=0; b=0; c=0; num[A]=""; num[B]=""; num[C]=""; for ( i = 9; i <= 13; i++) num[A]=num[A]""$i; for (j = 14; j <= 18; j++) num[B]=num[B]""$j; for (k = 19; k <= 23; k++) num[C]=num[C]""$k; if ((num[A] != "00000") && (num[A] != "11111")) { a=1; } if (num[B] != "00000") { b=1; } if (num[C] != "00000") { c=1; } if ((a == 1) || (b == 1) || (c == 1)) { print; } } 

Finally I think I've found a solution, I don't know why the other code doesn't work for me.

BEGIN { FS = "\t" cont=0; } { a=0; b=0; c=0; sum1=$9+$10+$11+$12+$13; sum2=$14+$15+$16+$17+$18; sum3=$19+$20+$21+$22+$23; if (( sum1 > 0 ) && ( sum1 < 5 )) { a=1; } if ( sum2 > 0 ) { b=1; } if ( sum3 > 0 ) { c=1; } if ((a == 1) || (b == 1) || (c == 1)) { cont++; print; } } END { print "Total: "NR; print "OK: "cont; } 
2
  • If I'm understanding your code correctly, the three groups of five are important? So you would not want to print a line with 00000 11111 00000, but you WOULD want to print a line with 00011 11100 01010. Is that correct? Commented Nov 2, 2015 at 12:57
  • @ghoti yes, that is correct. Commented Nov 2, 2015 at 14:43

3 Answers 3

1

if you translate your requirement from english into regex then give to grep, it will do what you want:

grep -vE '(1\s+){6,}|(0\s+){6,}' file 

You can adjust the \s+, for example change it to \t or something else for your needs.

Update

awk -F'\t' '{s=NF-15+1 c=i=0 while(++c<=3){ x=i?i:s t=0 for(i=x;i<x+5;i++) t+=$i+0 if(t==0||t==5) next } print }' file 

This give your the expected output. It checks the "more than FOUR consecutive zeros/ones" instead of FIVE, because each group has max. 5 elements/columns, ">5" will never happen.

Sign up to request clarification or add additional context in comments.

7 Comments

If you divide the last fifteen fields into three, each of these groups can't have more than five ones or more than five zeros. Thank you.
@Criatos oh you just want to check each "group" in the last 15 fields? E.g. in this way (5 cols)(5 cols)(5 cols)?
that it's what i want
That's why including code in the question is important. :-)
@Criatos if you divide 15 cols into 3 groups, how come one group that having more than five consecutive zeros or more than five consecutive ones? there are totally 5 fields in each group.
|
0

awk 4

awk 'split($0,t,/(1 +){6,}|(0 +){6,}/)<2' file 

awk 3.1

awk --posix 'split($0,t,/(1 +){6,}|(0 +){6,}/)<2' file 

update

awk '{for(i=9;i<=NF;i++){a[$i];if(++c==5){l=length(a);delete a;c=0;if(l>1){print;break}}}}' file 

1 Comment

If you divide the last fifteen fields into three, each of these groups can't have more than five ones or more than five zeros. Thank you.
0

The following ERE in grep works with your input data, where ALL THREE groups of five have matching content:

egrep -v '(\s+[01])\1\1\1\1(\s+[01])\2\2\2\2(\s+[01])\3\3\3\3' file 

Since your question is tagged , though, let's express this in awk.

We can't do the same thing in awk, because awk traditionally does not support backreferences in regular expressions. So as your script suggests, doing this programmatically may be the answer. Your solution concatenates fields and compares strings. I think I would probably use arithmetic instead -- a sum of the five fields is a number from zero to five. A value of zero or five means "skip", anything else means "print".

#!/usr/bin/awk -f { # Count back from the end in groups of five, until we hit e field # that is neither "0" nor "1"... start=NF; while ($start ~ /^[01]$/) { group++; for(i=start;i>start-5;i--) { sum[group]+=$i; } start=i; } # Step through groups, adding a condition to a counter. # At the end of the loop, if found > 0, then we've found a line # that does not have the pattern specified. found=0; while (--group) { found+=(sum[group] > 1 && sum[group] < 5); } } # If found > 0, print the line. found 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.