0

I have the following table:

CHR BP SNP CM AN1 1 15558213 rs2845371 0 -1.10837716961610 1 15558230 rs16981507 0 -1.13721847993853 1 15558586 rs5993924 0 -1.34239265871644 1 15563103 rs3016111 0 -1.61194237184708 

I would like to select the highest 2% of the values in column 5, and when it is true, write 1 and when it is false, write 0.

I figured that I need to use the if...else command. However, I don't know how to define the first line (if col5= top2%)

if col5= top2% then awk '{$5=1 ; print ;}' file else awk '{$5=0 ; print ;}' $file fi 

I would be very grateful if you can direct me to the way to solve this.

6
  • Welcome, could you explain what you mean exactly with "top 2%"? Commented Feb 25, 2021 at 22:46
  • Sure! I want to select the highest values of all in column 5, and specifically the highest 2%. For example, we have these values: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. 10 is in the top 10% of all values, so if my criteria would be 10%, I select 10 and write 1 instead of the actual number. Hope I explained it clearly! Commented Feb 25, 2021 at 22:50
  • I am not skilled enough with awk to do this in one pass, but I am guessing that you do not know the range of values before you begin? (e.g. you want top 2%, I am guessing you do not know "2% of what value" before hand. You can do this by finding the top values with awk if ($5 < highest) highest = $5 for example, this would find the max of $5. You could find the max and the next highest max.. but you want a percentage. Again, do you know the total range of $5 before hand? If not, you need to make two passes, one to find the range of $5, then you can calculate 2% and use awk similar to example. Commented Feb 25, 2021 at 23:36
  • I do not know the total range unfortunately. Great suggestion, I will try it, thanks! Commented Feb 25, 2021 at 23:42
  • Is there a reason we could not just put them in order highest to lowest and then just take the top 1 or 2. Commented Feb 25, 2021 at 23:49

2 Answers 2

1
awk ' PASS==1{ if (FNR==2){ min=max=$5; next } min=($5 < min ? $5 : min) max=($5 > max ? $5 : max) next } FNR==1{ threshold=(max - ((max - min) / 50)) } FNR>1 { $5=($5 >= threshold) } 1 ' PASS=1 file PASS=2 file 

Read the input file in two passes.

First pass: Determine min and max value of the 5th field.
Second pass: Determine the threshold for the top 2% values on the first record. On any other records, set the 5th field to 0 or 1 depending if the field is greater equal the threshold. Then print the record.

Output:

CHR BP SNP CM AN1 1 15558213 rs2845371 0 1 1 15558230 rs16981507 0 0 1 15558586 rs5993924 0 0 1 15563103 rs3016111 0 0 
5
  • Great idea. Although using this approach I get 0 for each value in $5, no 1-s at all. Commented Feb 26, 2021 at 17:35
  • I get a threshold value of -1.11845 and the first value is greater. Commented Feb 26, 2021 at 18:50
  • Thanks, it's working now, I just sorted the values from the highest to the lowest first. Commented Feb 26, 2021 at 20:21
  • Sorry to bother you again, but can you direct me how to do this if instead of 2% I want 1% and 0.5% of top values? Thank you so much! Commented Feb 27, 2021 at 13:40
  • Divide by 50 for 2% (100/2), 100 for 1% (100/1), 200 for 0.5% (100/0.5) ... Commented Feb 27, 2021 at 14:26
-1

Sort it first perhaps:

cat FileName.txt | sort -r -n -k5 | head 
1
  • 1
    Thank you for your contribution. Please note that your answer does not addresses the "write 1 if it is true and 0 when it is false" part of the OPs question; you may want to expand it in that respect. Commented Feb 26, 2021 at 8:18

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.