Revisions to Is there a way to get the min, max, median, and average of a list of numbers in a single command?

added 1 character in body

edited Oct 11, 2015 at 4:33

121
2

Taking cues from Bruce's code, here is a more efficient implementation which does not keep the whole data in memory. As stated in the question, it assumes that the input file has (at most) one number per line. It counts the lines in the input file that contain a qualifying number and passes the count to the awk command along with (preceding) the sorted data. So, for example, if the file contains

6.0 4.2 8.3 9.5 1.7

then the input to awk is actually

5 1.7 4.2 6.0 8.3 9.5

Then the awk script captures the data count in the NR==1 code block and saves the middle value (or the two middle values, which are averaged to yield the median) when it sees them.

FILENAME="Salaries.csv" (awk 'BEGIN {c=0} $1 ~ /^[-0-9]9]*(\.[0-9]*)?$/ {c=c+1;} END {print c;}' "$FILENAME"; \ sort -n "$FILENAME") | awk ' BEGIN { c = 0 sum = 0 med1_loc = 0 med2_loc = 0 med1_val = 0 med2_val = 0 min = 0 max = 0 } NR==1 { LINES = $1 # We check whether numlines is even or odd so that we keep only # the locations in the array where the median might be. if (LINES%2==0) {med1_loc = LINES/2-1; med2_loc = med1_loc+1;} if (LINES%2!=0) {med1_loc = med2_loc = (LINES-1)/2;} } $1 ~ /^[-0-9]*(\.[0-9]*)?$/ && NR!=1 { # setting min value if (c==0) {min = $1;} # middle two values in array if (c==med1_loc) {med1_val = $1;} if (c==med2_loc) {med2_val = $1;} c++ sum += $1 max = $1 } END { ave = sum / c median = (med1_val + med2_val ) / 2 print "sum:" sum print "count:" c print "mean:" ave print "median:" median print "min:" min print "max:" max } '

Taking cues from Bruce's code, here is a more efficient implementation which does not keep the whole data in memory. As stated in the question, it assumes that the input file has (at most) one number per line. It counts the lines in the input file that contain a qualifying number and passes the count to the awk command along with (preceding) the sorted data. So, for example, if the file contains

6.0 4.2 8.3 9.5 1.7

then the input to awk is actually

5 1.7 4.2 6.0 8.3 9.5

Then the awk script captures the data count in the NR==1 code block and saves the middle value (or the two middle values, which are averaged to yield the median) when it sees them.

FILENAME="Salaries.csv" (awk 'BEGIN {c=0} $1 ~ /^[-0-9](\.[0-9]*)?$/ {c=c+1;} END {print c;}' "$FILENAME"; \ sort -n "$FILENAME") | awk ' BEGIN { c = 0 sum = 0 med1_loc = 0 med2_loc = 0 med1_val = 0 med2_val = 0 min = 0 max = 0 } NR==1 { LINES = $1 # We check whether numlines is even or odd so that we keep only # the locations in the array where the median might be. if (LINES%2==0) {med1_loc = LINES/2-1; med2_loc = med1_loc+1;} if (LINES%2!=0) {med1_loc = med2_loc = (LINES-1)/2;} } $1 ~ /^[-0-9]*(\.[0-9]*)?$/ && NR!=1 { # setting min value if (c==0) {min = $1;} # middle two values in array if (c==med1_loc) {med1_val = $1;} if (c==med2_loc) {med2_val = $1;} c++ sum += $1 max = $1 } END { ave = sum / c median = (med1_val + med2_val ) / 2 print "sum:" sum print "count:" c print "mean:" ave print "median:" median print "min:" min print "max:" max } '

Taking cues from Bruce's code, here is a more efficient implementation which does not keep the whole data in memory. As stated in the question, it assumes that the input file has (at most) one number per line. It counts the lines in the input file that contain a qualifying number and passes the count to the awk command along with (preceding) the sorted data. So, for example, if the file contains

6.0 4.2 8.3 9.5 1.7

then the input to awk is actually

5 1.7 4.2 6.0 8.3 9.5

Then the awk script captures the data count in the NR==1 code block and saves the middle value (or the two middle values, which are averaged to yield the median) when it sees them.

FILENAME="Salaries.csv" (awk 'BEGIN {c=0} $1 ~ /^[-0-9]*(\.[0-9]*)?$/ {c=c+1;} END {print c;}' "$FILENAME"; \ sort -n "$FILENAME") | awk ' BEGIN { c = 0 sum = 0 med1_loc = 0 med2_loc = 0 med1_val = 0 med2_val = 0 min = 0 max = 0 } NR==1 { LINES = $1 # We check whether numlines is even or odd so that we keep only # the locations in the array where the median might be. if (LINES%2==0) {med1_loc = LINES/2-1; med2_loc = med1_loc+1;} if (LINES%2!=0) {med1_loc = med2_loc = (LINES-1)/2;} } $1 ~ /^[-0-9]*(\.[0-9]*)?$/ && NR!=1 { # setting min value if (c==0) {min = $1;} # middle two values in array if (c==med1_loc) {med1_val = $1;} if (c==med2_loc) {med2_val = $1;} c++ sum += $1 max = $1 } END { ave = sum / c median = (med1_val + med2_val ) / 2 print "sum:" sum print "count:" c print "mean:" ave print "median:" median print "min:" min print "max:" max } '

Simplified code; added explanation.

Source Link

edited Oct 10, 2015 at 17:09

G-Man Says 'Reinstate Monica'

24k
29
76
130

Taking cues from Bruce's code. Here, here is a more efficient implementation which which does not keep the whole data in memory. Uses Process substitution to pass the length of data also to As stated in the awk code. Assumesquestion, it assumes that the input file salaries.csv has a single column(at most) one number per line.

I Start by calculating It counts the lines in the File which are not starting with 0 and cattinginput file that contain a qualifying number and passes the count to the awk command along with (preceding) the sorted data. So, for example, if the file and givingcontains

6.0 4.2 8.3 9.5 1.7

then the outputinput to awk is actually

5 1.7 4.2 6.0 8.3 9.5

Then the awk commandawk script captures the data count in the NR==1 code block and saves the middle value (or the two middle values, which are averaged to yield the median) when it sees them.

FILENAME="Salaries.csv" cat <(awk 'BEGIN {c=0} $1 ~ /^[-0-9](\.[0-9]*)?$/ {c=c+1;} END {print c;}' < "$FILENAME")"$FILENAME"; \ <( sort -n <"$FILENAME""$FILENAME") | awk '  BEGIN {   c = 0;0   sum = 0;0   med1_loc = 0;0   med2_loc = 0;0   med1_val = 0;0   med2_val = 0;0   min = 0;0   max = 0;0 }   NR==1 {   LINES = $1;$1   # #WeWe check ifwhether numlines it is even or odd so that we keep only   keep # the locations in the array where the median might be.   if (LINES%2==0) {med1_loc = LINES/2-1; med2_loc = med1_loc+1;}   if (LINES%2!=0) {med1_loc = med2_loc = (LINES-1)/2; med2_loc = med1_loc;}   }   $1 ~ /^[-0-9]*(\.[0-9]*)?$/  &&  NR!=1 { #setting# setting min value if (c==0) {min = $1;}  #middle# middle two values in array if (c==med1_loc) {med1_val = $1;} if (c==med2_loc) {med2_val = $1;}   c++; c++ sum += $1;$1 max = $1 } END { ave = sum / c;c median = (med1_val + med2_val ) / 2;2 print "sum:" sum   "\n"  print "count:" c   "\n"  print "mean:" ave   "\n"  print "median:" median   "\n"  print "min:" min   "\n"  print "max:" max;max } '

Taking cues from Bruce's code. Here is a more efficient implementation which does not keep the whole data in memory. Uses Process substitution to pass the length of data also to the awk code. Assumes that the file salaries.csv has a single column.

I Start by calculating the lines in the File which are not starting with 0 and catting that with the sorted file and giving the output to the awk command

FILENAME="Salaries.csv" cat <(awk 'BEGIN{c=0} $1 ~ /^[-0-9](\.[0-9]*)?$/{c=c+1;}END{print c;}' < "$FILENAME") <(sort -n <"$FILENAME") | awk ' BEGIN { c = 0; sum = 0; med1_loc = 0; med2_loc = 0; med1_val = 0; med2_val = 0; min = 0; max = 0; }   NR==1{   LINES = $1;   #We check if numlines it is even or odd so that we only keep locations in array where median might be   if (LINES%2==0){med1_loc = LINES/2-1; med2_loc = med1_loc+1;}   if (LINES%2!=0){med1_loc = (LINES-1)/2; med2_loc = med1_loc;}   }   $1 ~ /^[-0-9]*(\.[0-9]*)?$/ && NR!=1 { #setting min value if (c==0){min = $1;}  #middle two values in array if (c==med1_loc){med1_val = $1;} if (c==med2_loc){med2_val = $1;}   c++; sum += $1; max = $1 } END { ave = sum / c; median = (med1_val + med2_val ) / 2; print "sum:" sum "\n" "count:" c "\n" "mean:" ave "\n" "median:" median "\n" "min:" min "\n" "max:" max; } '

Taking cues from Bruce's code, here is a more efficient implementation which does not keep the whole data in memory. As stated in the question, it assumes that the input file has (at most) one number per line. It counts the lines in the input file that contain a qualifying number and passes the count to the awk command along with (preceding) the sorted data. So, for example, if the file contains

6.0 4.2 8.3 9.5 1.7

then the input to awk is actually

5 1.7 4.2 6.0 8.3 9.5

Then the awk script captures the data count in the NR==1 code block and saves the middle value (or the two middle values, which are averaged to yield the median) when it sees them.

FILENAME="Salaries.csv" (awk 'BEGIN {c=0} $1 ~ /^[-0-9](\.[0-9]*)?$/ {c=c+1;} END {print c;}' "$FILENAME"; \  sort -n "$FILENAME") | awk '  BEGIN {   c = 0   sum = 0   med1_loc = 0   med2_loc = 0   med1_val = 0   med2_val = 0   min = 0   max = 0 } NR==1 { LINES = $1 # We check whether numlines is even or odd so that we keep only    # the locations in the array where the median might be. if (LINES%2==0) {med1_loc = LINES/2-1; med2_loc = med1_loc+1;} if (LINES%2!=0) {med1_loc = med2_loc = (LINES-1)/2;} } $1 ~ /^[-0-9]*(\.[0-9]*)?$/  &&  NR!=1 { # setting min value if (c==0) {min = $1;} # middle two values in array if (c==med1_loc) {med1_val = $1;} if (c==med2_loc) {med2_val = $1;} c++ sum += $1 max = $1 } END { ave = sum / c median = (med1_val + med2_val ) / 2 print "sum:" sum     print "count:" c     print "mean:" ave     print "median:" median     print "min:" min     print "max:" max } '

added 4 characters in body

Source Link

edited Oct 10, 2015 at 15:39

Rahul Agarwal

121
2

Taking cues from Bruce's code. Here is a more efficient implementation which does not keep the whole data in memory. Uses Process substitution to pass the length of data also to the awk code. Assumes that the file salaries.csv has a single column.

I Start by calculating the lines in the File which are not starting with 0 and catting that with the sorted file and giving the output to the awk command

FILENAME="Salaries.csv" cat <(awk 'BEGIN{c=0} $1 ~ /^[-0-9](\.[0-9]*)?$/{c=c+1;}END{print c;}' < $FILENAME"$FILENAME") <(sort -n <$FILENAME<"$FILENAME") | awk ' BEGIN { c = 0; sum = 0; med1_loc = 0; med2_loc = 0; med1_val = 0; med2_val = 0; min = 0; max = 0; } NR==1{ LINES = $1; #We check if numlines it is even or odd so that we only keep locations in array where median might be if (LINES%2==0){med1_loc = LINES/2-1; med2_loc = med1_loc+1;} if (LINES%2!=0){med1_loc = (LINES-1)/2; med2_loc = med1_loc;} } $1 ~ /^[-0-9]*(\.[0-9]*)?$/ && NR!=1 { #setting min value if (c==0){min = $1;} #middle two values in array if (c==med1_loc){med1_val = $1;} if (c==med2_loc){med2_val = $1;} c++; sum += $1; max = $1 } END { ave = sum / c; median = (med1_val + med2_val ) / 2; print "sum:" sum "\n" "count:" c "\n" "mean:" ave "\n" "median:" median "\n" "min:" min "\n" "max:" max; } '

Taking cues from Bruce's code. Here is a more efficient implementation which does not keep the whole data in memory. Uses Process substitution to pass the length of data also to the awk code. Assumes that the file salaries.csv has a single column.

I Start by calculating the lines in the File which are not starting with 0 and catting that with the sorted file and giving the output to the awk command

FILENAME="Salaries.csv" cat <(awk 'BEGIN{c=0} $1 ~ /^[-0-9](\.[0-9]*)?$/{c=c+1;}END{print c;}' < $FILENAME) <(sort -n <$FILENAME) | awk ' BEGIN { c = 0; sum = 0; med1_loc = 0; med2_loc = 0; med1_val = 0; med2_val = 0; min = 0; max = 0; } NR==1{ LINES = $1; #We check if numlines it is even or odd so that we only keep locations in array where median might be if (LINES%2==0){med1_loc = LINES/2-1; med2_loc = med1_loc+1;} if (LINES%2!=0){med1_loc = (LINES-1)/2; med2_loc = med1_loc;} } $1 ~ /^[-0-9]*(\.[0-9]*)?$/ && NR!=1 { #setting min value if (c==0){min = $1;} #middle two values in array if (c==med1_loc){med1_val = $1;} if (c==med2_loc){med2_val = $1;} c++; sum += $1; max = $1 } END { ave = sum / c; median = (med1_val + med2_val ) / 2; print "sum:" sum "\n" "count:" c "\n" "mean:" ave "\n" "median:" median "\n" "min:" min "\n" "max:" max; } '

Taking cues from Bruce's code. Here is a more efficient implementation which does not keep the whole data in memory. Uses Process substitution to pass the length of data also to the awk code. Assumes that the file salaries.csv has a single column.

I Start by calculating the lines in the File which are not starting with 0 and catting that with the sorted file and giving the output to the awk command

FILENAME="Salaries.csv" cat <(awk 'BEGIN{c=0} $1 ~ /^[-0-9](\.[0-9]*)?$/{c=c+1;}END{print c;}' < "$FILENAME") <(sort -n <"$FILENAME") | awk ' BEGIN { c = 0; sum = 0; med1_loc = 0; med2_loc = 0; med1_val = 0; med2_val = 0; min = 0; max = 0; } NR==1{ LINES = $1; #We check if numlines it is even or odd so that we only keep locations in array where median might be if (LINES%2==0){med1_loc = LINES/2-1; med2_loc = med1_loc+1;} if (LINES%2!=0){med1_loc = (LINES-1)/2; med2_loc = med1_loc;} } $1 ~ /^[-0-9]*(\.[0-9]*)?$/ && NR!=1 { #setting min value if (c==0){min = $1;} #middle two values in array if (c==med1_loc){med1_val = $1;} if (c==med2_loc){med2_val = $1;} c++; sum += $1; max = $1 } END { ave = sum / c; median = (med1_val + med2_val ) / 2; print "sum:" sum "\n" "count:" c "\n" "mean:" ave "\n" "median:" median "\n" "min:" min "\n" "max:" max; } '

added 328 characters in body

Source Link

edited Oct 10, 2015 at 15:31

Rahul Agarwal

121
2

Loading

Source Link

answered Oct 10, 2015 at 1:44

Rahul Agarwal

121
2

Loading

Stack Exchange Network

Return to Answer