1

I have two columns of data, I want add the number of repeating records in column A and then calculate the number of tallies in column B and then calculate the percentage of tallies. Example:

494 1 494 494 494 1 500 500 1 500 501 501 501 1 501 

For 494, there are 4 records and 2 tallies, so I would like to calculate 2/4 = .50 and so on.

2
  • 2
    Is the input always sorted (or at least collated) on the first column (which is sometimes the only column)?  Is the second column always either 1 or blank?  Please do not respond in comments; edit your question to make it clearer and more complete. Commented Oct 12, 2017 at 19:36
  • 1
    Great that you've detailed the input. Could you update the question to clearly show desired output ? Commented Oct 12, 2017 at 21:14

2 Answers 2

1

As a one-liner this awk example is rather complicated.

{ if (A!=$1) { # This section has a different A-column if (a) { # If a>0, then it is not the beginning print A,b/a # Print result } A=$1; # Re-init variables a=0; b=0 } ++a; b += $2 ? 1 : 0 } 

To run this, put the awk script in frac-calc and the numbers in number and run it:

( cat number; echo ) | awk -E frac-calc 

The output would be:

494 0.5 500 0.333333 501 0.25 

The reason why the echo is needed, is that it ensure the result of the last block (501) to be printed, as column A is different.

It can also be a long one-liner:

( cat number; echo ) | awk '{if(A!=$1){if(a){print A,b/a}A=$1;a=0;b=0}++a;b+=$2?1:0}' 

Edit: With the use of END and without echo as mentioned in the comments:

{ if (A!=$1) { # This section has a different A-column if (a) { # If a>0, then it is not the beginning print A,b/a # Print result } A=$1; # Re-init variables a=0; b=0 } ++a; b += $2 ? 1 : 0 } END { print A,b/a # Print result } 

And call it:

awk -E frac-calc number 

The one liner is then a bit longer:

awk '{if(A!=$1){if(a){print A,b/a}A=$1;a=0;b=0}++a;b+=$2?1:0}END{print A,b/a}' number 
3
  • Why the cat and echo? Why not just awk ... number? Commented Oct 13, 2017 at 8:17
  • Because the the last set (501) will not get calculated before there is a line without 501. One could also add an extra line to the file like echo >> number. Commented Oct 13, 2017 at 8:34
  • 1
    That's what END is for. Commented Oct 14, 2017 at 21:29
0

First version - the two dimensional array is used.

gawk ' BEGIN { PROCINFO["sorted_in"] = "@ind_num_asc"; } { arr[$1][0]++; arr[$1][1] += $2; } END { for(i in arr) { print i, arr[i][1] / arr[i][0]; } }' input.txt 

The PROCINFO["sorted_in"] = "@ind_num_asc"; line is explained here - Using Predefined Array Scanning Orders.

In this case, it can be replaced by piping the gawk output to the sort -n command:

gawk ' { arr[$1][0]++; arr[$1][1] += $2; } END { for(i in arr) { print i, arr[i][1] / arr[i][0]; } }' input.txt | sort -n 

Second version - more optimal variant, without array.

gawk ' NR == 1 { record = $1; } record != $1 { print record, tallies / cnt; record = $1; cnt = 0; tallies = 0; } { cnt++; tallies += $2; } END { print record, tallies / cnt; }' input.txt 

Output:

494 0.5 500 0.333333 501 0.25 

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.