7

I have a file with many numbers in it (only numbers and each number is in one line). I want to find out the number of lines in which the number is greater than 100 (or infact anything else). How can I do that?

0

2 Answers 2

13

Let's consider this test file:

$ cat myfile 98 99 100 101 102 103 104 105 

Now, let's count the number of lines with a number greater than 100:

$ awk '$1>100{c++} END{print c+0}' myfile 5 

How it works

  • $1>100{c++}

    Every time that the number on the line is greater than 100, the variable c is incremented by 1.

  • END{print c+0}

    After we have finished reading the file, the variable c is printed.

    By adding 0 to c, we force awk to treat c like a number. If there were any lines with numbers >100, then c is already a number. If there were not, then c would be an empty (hat tip: iruvar). By adding zero to it, we change the empty string to a 0, giving a more correct output.

2
  • 2
    I would change the print c to print 0+c or even print +c so a sane value of 0 is printed when no line exists with a number greater than 100 Commented Sep 26, 2016 at 4:05
  • @iruvar Good point! Thanks. answer updated with +0 to force conversion to a number. Commented Sep 26, 2016 at 5:32
2

Similar solution with perl

$ seq 98 105 | perl -ne '$c++ if $_ > 100; END{print $c+0 ."\n"}' 5 


Speed comparison: numbers reported for 3 consecutive runs

Random file:

$ perl -le 'print int(rand(200)) foreach (0..10000000)' > rand_numbers.txt $ perl -le 'print int(rand(100200)) foreach (0..10000000)' >> rand_numbers.txt $ shuf rand_numbers.txt -o rand_numbers.txt $ tail -5 rand_numbers.txt 114 100 66125 84281 144 $ wc rand_numbers.txt 20000002 20000002 93413515 rand_numbers.txt $ du -h rand_numbers.txt 90M rand_numbers.txt 

With awk

$ time awk '$1>100{c++} END{print c+0}' rand_numbers.txt 14940305 real 0m7.754s real 0m8.150s real 0m7.439s 

With perl

$ time perl -ne '$c++ if $_ > 100; END{print $c+0 ."\n"}' rand_numbers.txt 14940305 real 0m4.145s real 0m4.146s real 0m4.196s 

And just for fun with grep (Updated: faster than even Perl with LC_ALL=C)

$ time grep -xcE '10[1-9]|1[1-9][0-9]|[2-9][0-9]{2,}|1[0-9]{3,}' rand_numbers.txt 14940305 real 0m10.622s $ time LC_ALL=C grep -xcE '10[1-9]|1[1-9][0-9]|[2-9][0-9]{2,}|1[0-9]{3,}' rand_numbers.txt 14940305 real 0m0.886s real 0m0.889s real 0m0.892s 

sed is no fun:

$ time sed -nE '/^10[1-9]|1[1-9][0-9]|[2-9][0-9]{2,}|1[0-9]{3,}$/p' rand_numbers.txt | wc -l 14940305 real 0m11.929s $ time LC_ALL=C sed -nE '/^10[1-9]|1[1-9][0-9]|[2-9][0-9]{2,}|1[0-9]{3,}$/p' rand_numbers.txt | wc -l 14940305 real 0m6.238s 
2
  • 1
    To be fair compare apples to apples: compare grep w/o -c piped through wc -l to the sed solution, but I expect sed would still be slower. Commented Dec 12, 2017 at 4:48
  • yeah, I had included sed only because it was tagged by OP.. sed isn't the tool to use for arithmetic.. and I was actually surprised when I checked grep + LC_ALL=C today which prompted the edit.. Commented Dec 12, 2017 at 5:05

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.