Return to Revisions

5 of 5

added 1 character in body

edited Oct 11, 2015 at 4:33

Taking cues from Bruce's code, here is a more efficient implementation which does not keep the whole data in memory. As stated in the question, it assumes that the input file has (at most) one number per line. It counts the lines in the input file that contain a qualifying number and passes the count to the awk command along with (preceding) the sorted data. So, for example, if the file contains

6.0 4.2 8.3 9.5 1.7

then the input to awk is actually

5 1.7 4.2 6.0 8.3 9.5

Then the awk script captures the data count in the NR==1 code block and saves the middle value (or the two middle values, which are averaged to yield the median) when it sees them.

FILENAME="Salaries.csv" (awk 'BEGIN {c=0} $1 ~ /^[-0-9]*(\.[0-9]*)?$/ {c=c+1;} END {print c;}' "$FILENAME"; \ sort -n "$FILENAME") | awk ' BEGIN { c = 0 sum = 0 med1_loc = 0 med2_loc = 0 med1_val = 0 med2_val = 0 min = 0 max = 0 } NR==1 { LINES = $1 # We check whether numlines is even or odd so that we keep only # the locations in the array where the median might be. if (LINES%2==0) {med1_loc = LINES/2-1; med2_loc = med1_loc+1;} if (LINES%2!=0) {med1_loc = med2_loc = (LINES-1)/2;} } $1 ~ /^[-0-9]*(\.[0-9]*)?$/ && NR!=1 { # setting min value if (c==0) {min = $1;} # middle two values in array if (c==med1_loc) {med1_val = $1;} if (c==med2_loc) {med2_val = $1;} c++ sum += $1 max = $1 } END { ave = sum / c median = (med1_val + med2_val ) / 2 print "sum:" sum print "count:" c print "mean:" ave print "median:" median print "min:" min print "max:" max } '

answered Oct 10, 2015 at 1:44

Rahul Agarwal