Skip to main content
added 1 character in body
Source Link

Taking cues from Bruce's code, here is a more efficient implementation which does not keep the whole data in memory.  As stated in the question, it assumes that the input file has (at most) one number per line.  It counts the lines in the input file that contain a qualifying number and passes the count to the awk command along with (preceding) the sorted data.  So, for example, if the file contains

6.0 4.2 8.3 9.5 1.7 

then the input to awk is actually

5 1.7 4.2 6.0 8.3 9.5 

Then the awk script captures the data count in the NR==1 code block and saves the middle value (or the two middle values, which are averaged to yield the median) when it sees them.

FILENAME="Salaries.csv" (awk 'BEGIN {c=0} $1 ~ /^[-0-9]9]*(\.[0-9]*)?$/ {c=c+1;} END {print c;}' "$FILENAME"; \ sort -n "$FILENAME") | awk ' BEGIN { c = 0 sum = 0 med1_loc = 0 med2_loc = 0 med1_val = 0 med2_val = 0 min = 0 max = 0 } NR==1 { LINES = $1 # We check whether numlines is even or odd so that we keep only # the locations in the array where the median might be. if (LINES%2==0) {med1_loc = LINES/2-1; med2_loc = med1_loc+1;} if (LINES%2!=0) {med1_loc = med2_loc = (LINES-1)/2;} } $1 ~ /^[-0-9]*(\.[0-9]*)?$/ && NR!=1 { # setting min value if (c==0) {min = $1;} # middle two values in array if (c==med1_loc) {med1_val = $1;} if (c==med2_loc) {med2_val = $1;} c++ sum += $1 max = $1 } END { ave = sum / c median = (med1_val + med2_val ) / 2 print "sum:" sum print "count:" c print "mean:" ave print "median:" median print "min:" min print "max:" max } ' 

Taking cues from Bruce's code, here is a more efficient implementation which does not keep the whole data in memory.  As stated in the question, it assumes that the input file has (at most) one number per line.  It counts the lines in the input file that contain a qualifying number and passes the count to the awk command along with (preceding) the sorted data.  So, for example, if the file contains

6.0 4.2 8.3 9.5 1.7 

then the input to awk is actually

5 1.7 4.2 6.0 8.3 9.5 

Then the awk script captures the data count in the NR==1 code block and saves the middle value (or the two middle values, which are averaged to yield the median) when it sees them.

FILENAME="Salaries.csv" (awk 'BEGIN {c=0} $1 ~ /^[-0-9](\.[0-9]*)?$/ {c=c+1;} END {print c;}' "$FILENAME"; \ sort -n "$FILENAME") | awk ' BEGIN { c = 0 sum = 0 med1_loc = 0 med2_loc = 0 med1_val = 0 med2_val = 0 min = 0 max = 0 } NR==1 { LINES = $1 # We check whether numlines is even or odd so that we keep only # the locations in the array where the median might be. if (LINES%2==0) {med1_loc = LINES/2-1; med2_loc = med1_loc+1;} if (LINES%2!=0) {med1_loc = med2_loc = (LINES-1)/2;} } $1 ~ /^[-0-9]*(\.[0-9]*)?$/ && NR!=1 { # setting min value if (c==0) {min = $1;} # middle two values in array if (c==med1_loc) {med1_val = $1;} if (c==med2_loc) {med2_val = $1;} c++ sum += $1 max = $1 } END { ave = sum / c median = (med1_val + med2_val ) / 2 print "sum:" sum print "count:" c print "mean:" ave print "median:" median print "min:" min print "max:" max } ' 

Taking cues from Bruce's code, here is a more efficient implementation which does not keep the whole data in memory.  As stated in the question, it assumes that the input file has (at most) one number per line.  It counts the lines in the input file that contain a qualifying number and passes the count to the awk command along with (preceding) the sorted data.  So, for example, if the file contains

6.0 4.2 8.3 9.5 1.7 

then the input to awk is actually

5 1.7 4.2 6.0 8.3 9.5 

Then the awk script captures the data count in the NR==1 code block and saves the middle value (or the two middle values, which are averaged to yield the median) when it sees them.

FILENAME="Salaries.csv" (awk 'BEGIN {c=0} $1 ~ /^[-0-9]*(\.[0-9]*)?$/ {c=c+1;} END {print c;}' "$FILENAME"; \ sort -n "$FILENAME") | awk ' BEGIN { c = 0 sum = 0 med1_loc = 0 med2_loc = 0 med1_val = 0 med2_val = 0 min = 0 max = 0 } NR==1 { LINES = $1 # We check whether numlines is even or odd so that we keep only # the locations in the array where the median might be. if (LINES%2==0) {med1_loc = LINES/2-1; med2_loc = med1_loc+1;} if (LINES%2!=0) {med1_loc = med2_loc = (LINES-1)/2;} } $1 ~ /^[-0-9]*(\.[0-9]*)?$/ && NR!=1 { # setting min value if (c==0) {min = $1;} # middle two values in array if (c==med1_loc) {med1_val = $1;} if (c==med2_loc) {med2_val = $1;} c++ sum += $1 max = $1 } END { ave = sum / c median = (med1_val + med2_val ) / 2 print "sum:" sum print "count:" c print "mean:" ave print "median:" median print "min:" min print "max:" max } ' 
Simplified code; added explanation.
Source Link

Taking cues from Bruce's code. Here, here is a more efficient implementation which which does not keep the whole data in memory. Uses Process substitution to pass the length of data also to  As stated in the awk code. Assumesquestion, it assumes that the input file salaries.csv has a single column(at most) one number per line.

I Start by calculating  It counts the lines in the File which are not starting with 0 and cattinginput file that contain a qualifying number and passes the count to the awk command along with (preceding) the sorted data.  So, for example, if the file and givingcontains

6.0 4.2 8.3 9.5 1.7 

then the outputinput to awk is actually

5 1.7 4.2 6.0 8.3 9.5 

Then the awk commandawk script captures the data count in the NR==1 code block and saves the middle value (or the two middle values, which are averaged to yield the median) when it sees them.

FILENAME="Salaries.csv" cat <(awk 'BEGIN {c=0} $1 ~ /^[-0-9](\.[0-9]*)?$/ {c=c+1;} END {print c;}' < "$FILENAME")"$FILENAME"; \ <( sort -n <"$FILENAME""$FILENAME") | awk '  BEGIN {   c = 0;0   sum = 0;0   med1_loc = 0;0   med2_loc = 0;0   med1_val = 0;0   med2_val = 0;0   min = 0;0   max = 0;0 }   NR==1 {   LINES = $1;$1   # #WeWe check ifwhether numlines it is even or odd so that we keep only   keep # the locations in the array where the median might be.   if (LINES%2==0) {med1_loc = LINES/2-1; med2_loc = med1_loc+1;}   if (LINES%2!=0) {med1_loc = med2_loc = (LINES-1)/2; med2_loc = med1_loc;}   }   $1 ~ /^[-0-9]*(\.[0-9]*)?$/  &&  NR!=1 { #setting# setting min value if (c==0) {min = $1;}  #middle# middle two values in array if (c==med1_loc) {med1_val = $1;} if (c==med2_loc) {med2_val = $1;}   c++; c++ sum += $1;$1 max = $1 } END { ave = sum / c;c median = (med1_val + med2_val ) / 2;2 print "sum:" sum   "\n"  print "count:" c   "\n"  print "mean:" ave   "\n"  print "median:" median   "\n"  print "min:" min   "\n"  print "max:" max;max } ' 

Taking cues from Bruce's code. Here is a more efficient implementation which does not keep the whole data in memory. Uses Process substitution to pass the length of data also to the awk code. Assumes that the file salaries.csv has a single column.

I Start by calculating the lines in the File which are not starting with 0 and catting that with the sorted file and giving the output to the awk command

FILENAME="Salaries.csv" cat <(awk 'BEGIN{c=0} $1 ~ /^[-0-9](\.[0-9]*)?$/{c=c+1;}END{print c;}' < "$FILENAME") <(sort -n <"$FILENAME") | awk ' BEGIN { c = 0; sum = 0; med1_loc = 0; med2_loc = 0; med1_val = 0; med2_val = 0; min = 0; max = 0; }   NR==1{   LINES = $1;   #We check if numlines it is even or odd so that we only keep locations in array where median might be   if (LINES%2==0){med1_loc = LINES/2-1; med2_loc = med1_loc+1;}   if (LINES%2!=0){med1_loc = (LINES-1)/2; med2_loc = med1_loc;}   }   $1 ~ /^[-0-9]*(\.[0-9]*)?$/ && NR!=1 { #setting min value if (c==0){min = $1;}  #middle two values in array if (c==med1_loc){med1_val = $1;} if (c==med2_loc){med2_val = $1;}   c++; sum += $1; max = $1 } END { ave = sum / c; median = (med1_val + med2_val ) / 2; print "sum:" sum "\n" "count:" c "\n" "mean:" ave "\n" "median:" median "\n" "min:" min "\n" "max:" max; } ' 

Taking cues from Bruce's code, here is a more efficient implementation which does not keep the whole data in memory.  As stated in the question, it assumes that the input file has (at most) one number per line.  It counts the lines in the input file that contain a qualifying number and passes the count to the awk command along with (preceding) the sorted data.  So, for example, if the file contains

6.0 4.2 8.3 9.5 1.7 

then the input to awk is actually

5 1.7 4.2 6.0 8.3 9.5 

Then the awk script captures the data count in the NR==1 code block and saves the middle value (or the two middle values, which are averaged to yield the median) when it sees them.

FILENAME="Salaries.csv" (awk 'BEGIN {c=0} $1 ~ /^[-0-9](\.[0-9]*)?$/ {c=c+1;} END {print c;}' "$FILENAME"; \  sort -n "$FILENAME") | awk '  BEGIN {   c = 0   sum = 0   med1_loc = 0   med2_loc = 0   med1_val = 0   med2_val = 0   min = 0   max = 0 } NR==1 { LINES = $1 # We check whether numlines is even or odd so that we keep only    # the locations in the array where the median might be. if (LINES%2==0) {med1_loc = LINES/2-1; med2_loc = med1_loc+1;} if (LINES%2!=0) {med1_loc = med2_loc = (LINES-1)/2;} } $1 ~ /^[-0-9]*(\.[0-9]*)?$/  &&  NR!=1 { # setting min value if (c==0) {min = $1;} # middle two values in array if (c==med1_loc) {med1_val = $1;} if (c==med2_loc) {med2_val = $1;} c++ sum += $1 max = $1 } END { ave = sum / c median = (med1_val + med2_val ) / 2 print "sum:" sum     print "count:" c     print "mean:" ave     print "median:" median     print "min:" min     print "max:" max } ' 
added 4 characters in body
Source Link

Taking cues from Bruce's code. Here is a more efficient implementation which does not keep the whole data in memory. Uses Process substitution to pass the length of data also to the awk code. Assumes that the file salaries.csv has a single column.

I Start by calculating the lines in the File which are not starting with 0 and catting that with the sorted file and giving the output to the awk command

FILENAME="Salaries.csv" cat <(awk 'BEGIN{c=0} $1 ~ /^[-0-9](\.[0-9]*)?$/{c=c+1;}END{print c;}' < $FILENAME"$FILENAME") <(sort -n <$FILENAME<"$FILENAME") | awk ' BEGIN { c = 0; sum = 0; med1_loc = 0; med2_loc = 0; med1_val = 0; med2_val = 0; min = 0; max = 0; } NR==1{ LINES = $1; #We check if numlines it is even or odd so that we only keep locations in array where median might be if (LINES%2==0){med1_loc = LINES/2-1; med2_loc = med1_loc+1;} if (LINES%2!=0){med1_loc = (LINES-1)/2; med2_loc = med1_loc;} } $1 ~ /^[-0-9]*(\.[0-9]*)?$/ && NR!=1 { #setting min value if (c==0){min = $1;} #middle two values in array if (c==med1_loc){med1_val = $1;} if (c==med2_loc){med2_val = $1;} c++; sum += $1; max = $1 } END { ave = sum / c; median = (med1_val + med2_val ) / 2; print "sum:" sum "\n" "count:" c "\n" "mean:" ave "\n" "median:" median "\n" "min:" min "\n" "max:" max; } ' 

Taking cues from Bruce's code. Here is a more efficient implementation which does not keep the whole data in memory. Uses Process substitution to pass the length of data also to the awk code. Assumes that the file salaries.csv has a single column.

I Start by calculating the lines in the File which are not starting with 0 and catting that with the sorted file and giving the output to the awk command

FILENAME="Salaries.csv" cat <(awk 'BEGIN{c=0} $1 ~ /^[-0-9](\.[0-9]*)?$/{c=c+1;}END{print c;}' < $FILENAME) <(sort -n <$FILENAME) | awk ' BEGIN { c = 0; sum = 0; med1_loc = 0; med2_loc = 0; med1_val = 0; med2_val = 0; min = 0; max = 0; } NR==1{ LINES = $1; #We check if numlines it is even or odd so that we only keep locations in array where median might be if (LINES%2==0){med1_loc = LINES/2-1; med2_loc = med1_loc+1;} if (LINES%2!=0){med1_loc = (LINES-1)/2; med2_loc = med1_loc;} } $1 ~ /^[-0-9]*(\.[0-9]*)?$/ && NR!=1 { #setting min value if (c==0){min = $1;} #middle two values in array if (c==med1_loc){med1_val = $1;} if (c==med2_loc){med2_val = $1;} c++; sum += $1; max = $1 } END { ave = sum / c; median = (med1_val + med2_val ) / 2; print "sum:" sum "\n" "count:" c "\n" "mean:" ave "\n" "median:" median "\n" "min:" min "\n" "max:" max; } ' 

Taking cues from Bruce's code. Here is a more efficient implementation which does not keep the whole data in memory. Uses Process substitution to pass the length of data also to the awk code. Assumes that the file salaries.csv has a single column.

I Start by calculating the lines in the File which are not starting with 0 and catting that with the sorted file and giving the output to the awk command

FILENAME="Salaries.csv" cat <(awk 'BEGIN{c=0} $1 ~ /^[-0-9](\.[0-9]*)?$/{c=c+1;}END{print c;}' < "$FILENAME") <(sort -n <"$FILENAME") | awk ' BEGIN { c = 0; sum = 0; med1_loc = 0; med2_loc = 0; med1_val = 0; med2_val = 0; min = 0; max = 0; } NR==1{ LINES = $1; #We check if numlines it is even or odd so that we only keep locations in array where median might be if (LINES%2==0){med1_loc = LINES/2-1; med2_loc = med1_loc+1;} if (LINES%2!=0){med1_loc = (LINES-1)/2; med2_loc = med1_loc;} } $1 ~ /^[-0-9]*(\.[0-9]*)?$/ && NR!=1 { #setting min value if (c==0){min = $1;} #middle two values in array if (c==med1_loc){med1_val = $1;} if (c==med2_loc){med2_val = $1;} c++; sum += $1; max = $1 } END { ave = sum / c; median = (med1_val + med2_val ) / 2; print "sum:" sum "\n" "count:" c "\n" "mean:" ave "\n" "median:" median "\n" "min:" min "\n" "max:" max; } ' 
added 328 characters in body
Source Link
Loading
Source Link
Loading