3

I have a file that has 1000 text lines. I want to sort the 4th column at each 20 lines interval and print the output to another file. Can anybody help me with sorting them with awk or sed?

Here is an example of the data structure input

 1 1.1350 1092.42 0.0000 2 1.4645 846.58 0.0008 3 1.4760 840.01 0.0000 4 1.6586 747.52 0.0006 5 1.6651 744.60 0.0000 6 1.7750 698.51 0.0043 7 1.9216 645.20 0.0062 8 2.1708 571.14 0.0000 9 2.1839 567.71 0.0023 10 2.2582 549.04 0.0000 11 2.2878 541.93 1.1090 12 2.3653 524.17 0.0000 13 2.3712 522.88 0.0852 14 2.3928 518.15 0.0442 15 2.5468 486.82 0.0000 16 2.6504 467.79 0.0000 17 2.6909 460.75 0.0001 18 2.7270 454.65 0.0000 19 2.7367 453.04 0.0004 20 2.7996 442.87 0.0000 1 1.4962 828.64 0.0034 2 1.6848 735.91 0.0001 3 1.6974 730.45 0.0005 4 1.7378 713.47 0.0002 5 1.7385 713.18 0.0007 6 1.8086 685.51 0.0060 7 2.0433 606.78 0.0102 8 2.0607 601.65 0.0032 9 2.0970 591.24 0.0045 10 2.1033 589.48 0.0184 11 2.2396 553.61 0.0203 12 2.2850 542.61 1.1579 13 2.3262 532.99 0.0022 14 2.6288 471.64 0.0039 15 2.6464 468.51 0.0051 16 2.7435 451.92 0.0001 17 2.7492 450.98 0.0002 18 2.8945 428.34 0.0010 19 2.9344 422.52 0.0001 20 2.9447 421.04 0.0007 

expected output:

11 2.2878 541.93 1.1090 12 2.2850 542.61 1.1579 

Each n interval has only one highest (unique) value.

0

3 Answers 3

15

Via awk:

NR%20==1 {max=$4 ; line=$0} { if ($4>max) {max=$4;line=$0} } NR%20==0 {print line} 
1
  • Excellent one-process, one-pass solution. No need to sort the other 19 lines in each group: a case of premature non-optimisation. Commented Jun 8, 2022 at 7:52
6

With GNU sort and GNU split, you can do

split -l 20 file.txt --filter "sort -nk 4|tail -n 1" 

The file gets splitted in packets of 20 lines, then the filter option filters each packet by the given commands, so they get sorted numerically by the 4th key and only the last line (highest value) extracted by tail.

0
2

Using the DSU (Decorate/Sort/Undecorate) idiom with any awk+sort+cut:

$ awk -v OFS='\t' '(NR==1) || ($1<p){b++} {p=$1; print b, $0}' file | sort -k5,5rn | awk '!seen[$1]++' | sort -k1,1n | cut -f2- 11 2.2878 541.93 1.1090 12 2.2850 542.61 1.1579 

See https://stackoverflow.com/questions/71691113/how-to-sort-data-based-on-the-value-of-a-column-for-part-multiple-lines-of-a-f/71694367#71694367 for more info on DSU.

As mentioned in the comments by @StéphaneChazelas if you have GNU sort then you could abbreviate the above a little to:

awk -v OFS='\t' '(NR==1) || ($1<p){b++} {p=$1; print b, $0}' file | sort -k5,5rn | sort -suk1,1n | cut -f2- 
3
  • With GNU sort at least, you can replace the awk '!seen[$1]++' | sort -k1,1n with sort -uk1,1n Commented Jun 7, 2022 at 12:37
  • @StéphaneChazelas hmm, I think I'd need to add -s (for "stable sort") to make it sort -suk1,1n otherwise the unique 1st field selected may not be the first one from the input since the order of output given duplicate keys isn't guaranteed by default so you could get a line that doesn't contain the highest value output. If I do need -s as I think then it would indeed be specific to GNU sort, otherwise it'd work in any POSIX sort. Commented Jun 7, 2022 at 12:46
  • With GNU sort, where the sorting algorithm is stable, -s is to disable the last-resort full line comparison, but that's not relevant with -u (but doesn't harm). With other sort implementations YMMV. It definitely doesn't work with busybox sort but then again busybox sort is otherwise quite buggy. Commented Jun 7, 2022 at 13:01

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.