3

I have a big data file containing different information. I need to select and copy only some rows of this file in another one.

my_file.txt (Column are separated by "tab". I reported only the first column, but after that there is other information)

There are 2543 rows and 22 columns.

4gga_A_001_______________ clust_001 APC-coactivator_clust_001 4GGA-A Q12834 2.04 CDC20 APC-coactivator 4ggc_A_002_______________ clust_001 APC-coactivator_clust_001 4GGC-A Q12834 1.35 CDC20 APC-coactivator 4ggd_A_002_______________ clust_001 APC-coactivator_clust_001 4GGD-A Q12834 2.43 CDC20 APC-coactivator 4n14_A_002_______________ clust_001 APC-coactivator_clust_001 4N14-A Q12834 2.1 CDC20 APC-coactivator 5g04_R_002_______________ clust_001 APC-coactivator_clust_001 5G04-R Q12834 3.9 CDC20 APC-coactivator 5khu_R_006_______________ clust_001 APC-coactivator_clust_001 5KHU-R Q12834 4.8 CDC20 APC-coactivator 5lcw_Q_002_______________ clust_001 APC-coactivator_clust_001 5LCW-Q Q12834 4.2 CDC20 APC-coactivator 6q6g_R_004_______________ clust_001 APC-coactivator_clust_001 6Q6G-R Q12834 3.2 CDC20 APC-coactivator 6q6h_R_003_______________ clust_001 APC-coactivator_clust_001 6Q6H-R Q12834 3.2 CDC20 APC-coactivator 6q6g_R_005_______________ clust_016 APC-coactivator_clust_016 6Q6G-R Q12834 3.2 CDC20 APC-coactivator 6q6h_R_002_______________ clust_017 APC-coactivator_clust_017 6Q6H-R Q12834 3.2 CDC20 APC-coactivator 1u6d_X_001_______________ clust_001 BTB_clust_001 1u6d_X Q14145 1.85 KEAP1 BTB 1zgk_A_001_______________ clust_001 BTB_clust_001 1zgk_A Q14145 1.35 KEAP1 BTB 2vpj_A_001_______________ clust_001 BTB_clust_001 2vpj_A Q53G59 1.85 KLHL12 BTB 2xn4_A_001_______________ clust_001 BTB_clust_001 2xn4_A O95198 1.99 KLHL2 BTB 3vng_A_001_______________ clust_001 BTB_clust_001 3vng_A Q14145 2.1 KEAP1 BTB 3vnh_A_001_______________ clust_001 BTB_clust_001 3vnh_A Q14145 2.1 KEAP1 BTB 3zgc_A_001_______________ clust_001 BTB_clust_001 3zgc_A Q14145 2.2 KEAP1 BTB 3zgd_A_001_______________ clust_001 BTB_clust_001 3zgd_A Q14145 1.98 KEAP1 BTB 4ch9_A_001_______________ clust_001 BTB_clust_001 4ch9_A Q9UH77 1.84 KLHL3 BTB 4chb_A_001_______________ clust_001 BTB_clust_001 4chb_A O95198 1.56 KLHL2 BTB 4ifj_A_001_______________ clust_001 BTB_clust_001 4ifj_A Q14145 1.8 KEAP1 BTB 4ifl_X_001_______________ clust_001 BTB_clust_001 4ifl_X Q14145 1.8 KEAP1 BTB 4ifn_X_001_______________ clust_001 BTB_clust_001 4ifn_X Q14145 2.4 KEAP1 BTB 4in4_A_001_______________ clust_001 BTB_clust_001 4in4_A Q14145 2.59 KEAP1 BTB 4iqk_A_001_______________ clust_001 BTB_clust_001 4iqk_A Q14145 1.97 KEAP1 BTB 4l7b_A_001_______________ clust_001 BTB_clust_001 4l7b_A Q14145 2.41 KEAP1 BTB 4l7b_B_001_______________ clust_001 BTB_clust_001 4l7b_B Q14145 2.41 KEAP1 BTB 4l7c_A_001_______________ clust_001 BTB_clust_001 4l7c_A Q14145 2.4 KEAP1 BTB 4l7d_A_001_______________ clust_001 BTB_clust_001 4l7d_A Q14145 2.25 KEAP1 BTB 4n1b_A_001_______________ clust_001 BTB_clust_001 4n1b_A Q14145 2.55 KEAP1 BTB 4xmb_A_001_______________ clust_001 BTB_clust_001 4xmb_A Q14145 2.43 KEAP1 BTB 5f72_C_001_______________ clust_001 BTB_clust_001 5f72_C Q14145 1.85 KEAP1 BTB 5nkp_A_001_______________ clust_001 BTB_clust_001 5nkp_A Q9UH77 2.8 KLHL3 BTB 5wfl_A_001_______________ clust_001 BTB_clust_001 5wfl_A Q14145 1.93 KEAP1 BTB 5wfv_A_001_______________ clust_001 BTB_clust_001 5wfv_A Q14145 1.91 KEAP1 BTB 5wg1_A_002_______________ clust_001 BTB_clust_001 5wg1_A Q14145 2.02 KEAP1 BTB 5whl_A_002_______________ clust_001 BTB_clust_001 5whl_A Q14145 2.5 KEAP1 BTB 5whl_B_001_______________ clust_001 BTB_clust_001 5whl_B Q14145 2.5 KEAP1 BTB 5who_A_002_______________ clust_001 BTB_clust_001 5who_A Q14145 2.23 KEAP1 BTB 5who_B_001_______________ clust_001 BTB_clust_001 5who_B Q14145 2.23 KEAP1 BTB 5wiy_A_001_______________ clust_001 BTB_clust_001 5wiy_A Q14145 2.23 KEAP1 BTB 5wiy_B_001_______________ clust_001 BTB_clust_001 5wiy_B Q14145 2.23 KEAP1 BTB 5x54_A_001_______________ clust_001 BTB_clust_001 5x54_A Q14145 2.3 KEAP1 BTB 5yq4_A_001_______________ clust_001 BTB_clust_001 5yq4_A Q9Y2M5 1.58 KLHL20 BTB 5yy8_A_001_______________ clust_001 BTB_clust_001 5yy8_A Q9Y6Y0 1.98 IVNS1ABP BTB 6fmp_A_001_______________ clust_001 BTB_clust_001 6fmp_A Q14145 2.92 KEAP1 BTB 6fmq_A_001_______________ clust_001 BTB_clust_001 6fmq_A Q14145 2.1 KEAP1 BTB 6gy5_A_001_______________ clust_001 BTB_clust_001 6gy5_A Q9Y2M5 1.09 KLHL20 BTB 6hws_A_001_______________ clust_001 BTB_clust_001 6hws_A Q14145 1.75 KEAP1 BTB 6n3h_A_001_______________ clust_001 BTB_clust_001 6n3h_A Q9Y6Y0 2.6 IVNS1ABP BTB 6rog_A_001_______________ clust_001 BTB_clust_001 6rog_A Q14145 2.16 KEAP1 BTB 

I need to extract rows using the values in 3rd, 5th and 6th column. In details, for equal third column strings (e.g. APC-coactivator_clust_001, or APC-coactivator_clust_016 ...) I must extract the row to which corresponds, for each different fifth column value (e.g. Q12834 ...) the lowest sixth column value. I don’t know if I was clear enough. Anyway I bring you the output file that I should get.

outpout.txt 4ggc_A_002_______________ clust_001 APC-coactivator_clust_001 4GGC-A Q12834 1.35 CDC20 APC-coactivator 6q6g_R_005_______________ clust_016 APC-coactivator_clust_016 6Q6G-R Q12834 3.2 CDC20 APC-coactivator 6q6h_R_002_______________ clust_017 APC-coactivator_clust_017 6Q6H-R Q12834 3.2 CDC20 APC-coactivator 1zgk_A_001_______________ clust_001 BTB_clust_001 1zgk_A Q14145 1.35 KEAP1 BTB 2vpj_A_001_______________ clust_001 BTB_clust_001 2vpj_A Q53G59 1.85 KLHL12 BTB 4chb_A_001_______________ clust_001 BTB_clust_001 4chb_A O95198 1.56 KLHL2 BTB 4ch9_A_001_______________ clust_001 BTB_clust_001 4ch9_A Q9UH77 1.84 KLHL3 BTB 5yy8_A_001_______________ clust_001 BTB_clust_001 5yy8_A Q9Y6Y0 1.98 IVNS1ABP BTB 6gy5_A_001_______________ clust_001 BTB_clust_001 6gy5_A Q9Y2M5 1.09 KLHL20 BTB 
0

4 Answers 4

4

Using awk and process input file only once:

awk 'min[$3, $5]!=""{ if(min[$3, $5]>$6){ line[$3, $5]=$0; min[$3, $5]=$6}; next } { min[$3, $5]=$6; line[$3, $5]=$0 } END{ for(x in line) print line[x] }' infile 

To "keep lines with equal minimum values" in 6th column:

awk 'min[$3, $5]!=""{ if(min[$3, $5] >$6){ line[$3, $5]=$0; min[$3, $5]=$6 }; if(min[$3, $5]==$6){ line[$3, $5]=line[$3, $5] ORS $0 }; next } { min[$3, $5]=$6; line[$3, $5]=$0 } END{ for(x in line) print line[x] }' infile 
4

With awk

FNR==NR && !seen[$3,$5]++ {val[$3,$5]=$6} FNR==NR && seen[$3,$5] {if ($6<val[$3,$5]) {val[$3,$5]=$6} } NR!=FNR && val[$3,$5]==$6 

Run with

awk -f script.awk input input 

What does it do?

Create a pseudo-multidimentional array using columns 3 and 5 as indices and

  1. if there is no such element, get value of column 6
  2. if there is such element compare values with column 6 and pick smaller one
  3. Then rerun through the file and pick each line where the array indices match columns 3 and 5 and the value of column 6 fits the array element.

Runs twice through the file, but has very low RAM occupation. Sorting is as appears in input file.

0
2
sort -t$'\t' -k3,3 -k5,5 -k6n,6 file | awk -F\\t '!seen[$3,$5]++' 

The main thing sort is used for is the numeric sorting of field 6 - the following would also work:

sort -t$'\t' -k6n,6 file | awk -F\\t '!seen[$3,$5]++' 

However the output would not be grouped by columns 3 & 5. awk is used to print the first line containing a unique column 3/5 pair. "$(printf '\t')" may be used in place of $'\t' in a shell that doesn't support $'...' C strings.

awk processing file twice to keep same order as input & also keep lines with equal minimum values:

awk ' FNR==NR {if (min[$3,$5]=="" || $6<min[$3,$5]) min[$3,$5]=$6; next} $6==min[$3,$5] ' file file 
0

comes out in a different order than your suggested output, so if the order is not critical, this works:

sort -s -k3,3 -k5,5 -k6,6n < in | perl -ane 'print unless $seen{$F[2]}{$F[4]}++' > out 

If the original order is to be maintained, you can run

nl < in | sort -s -k4,4 -k6,6 -k7,7n | perl -ane 'print unless $seen{$F[3]}{$F[5]}++' | sort -k1,1n | cut -f2- > out 

However, even your sample output is not preserving the original order (grep 4ch[9b]_A_001 in your input and output samples and you will see).

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.