add and calculate percentage

Question

I have two columns of data, I want add the number of repeating records in column A and then calculate the number of tallies in column B and then calculate the percentage of tallies. Example:

494 1 494 494 494 1 500 500 1 500 501 501 501 1 501

For 494, there are 4 records and 2 tallies, so I would like to calculate 2/4 = .50 and so on.

Is the input always sorted (or at least collated) on the first column (which is sometimes the only column)? Is the second column always either 1 or blank? Please do not respond in comments; edit your question to make it clearer and more complete. — Scott - Слава Україні
– Scott - Слава Україні, Commented Oct 12, 2017 at 19:36
Great that you've detailed the input. Could you update the question to clearly show desired output ? — steve
– steve, Commented Oct 12, 2017 at 21:14

hschou · Accepted Answer · 2017-10-15 08:18:30Z

As a one-liner this awk example is rather complicated.

{ if (A!=$1) { # This section has a different A-column if (a) { # If a>0, then it is not the beginning print A,b/a # Print result } A=$1; # Re-init variables a=0; b=0 } ++a; b += $2 ? 1 : 0 }

To run this, put the awk script in frac-calc and the numbers in number and run it:

( cat number; echo ) | awk -E frac-calc

The output would be:

494 0.5 500 0.333333 501 0.25

The reason why the echo is needed, is that it ensure the result of the last block (501) to be printed, as column A is different.

It can also be a long one-liner:

( cat number; echo ) | awk '{if(A!=$1){if(a){print A,b/a}A=$1;a=0;b=0}++a;b+=$2?1:0}'

Edit: With the use of END and without echo as mentioned in the comments:

{ if (A!=$1) { # This section has a different A-column if (a) { # If a>0, then it is not the beginning print A,b/a # Print result } A=$1; # Re-init variables a=0; b=0 } ++a; b += $2 ? 1 : 0 } END { print A,b/a # Print result }

And call it:

awk -E frac-calc number

The one liner is then a bit longer:

awk '{if(A!=$1){if(a){print A,b/a}A=$1;a=0;b=0}++a;b+=$2?1:0}END{print A,b/a}' number

Because the the last set (501) will not get calculated before there is a line without 501. One could also add an extra line to the file like echo >> number. — hschou
– hschou, Commented Oct 13, 2017 at 8:34

MiniMax · Accepted Answer · 2017-10-13 00:37:41Z

First version - the two dimensional array is used.

gawk ' BEGIN { PROCINFO["sorted_in"] = "@ind_num_asc"; } { arr[$1][0]++; arr[$1][1] += $2; } END { for(i in arr) { print i, arr[i][1] / arr[i][0]; } }' input.txt

The PROCINFO["sorted_in"] = "@ind_num_asc"; line is explained here - Using Predefined Array Scanning Orders.

In this case, it can be replaced by piping the gawk output to the sort -n command:

gawk ' { arr[$1][0]++; arr[$1][1] += $2; } END { for(i in arr) { print i, arr[i][1] / arr[i][0]; } }' input.txt | sort -n

Second version - more optimal variant, without array.

gawk ' NR == 1 { record = $1; } record != $1 { print record, tallies / cnt; record = $1; cnt = 0; tallies = 0; } { cnt++; tallies += $2; } END { print record, tallies / cnt; }' input.txt

Output:

494 0.5 500 0.333333 501 0.25

Stack Exchange Network

add and calculate percentage

2 Answers 2

First version - the two dimensional array is used.

Second version - more optimal variant, without array.

You must log in to answer this question.

Hot Network Questions

add and calculate percentage

2 Answers 2

First version - the two dimensional array is used.

Second version - more optimal variant, without array.

You must log in to answer this question.

Related

Hot Network Questions