Simple statistics from the command line interface (CLI), fast.
This is a lightweight, fast tool for calculating basic descriptive statistics from the command line. Inspired by https://github.com/nferraz/st, this project differs in that it is written in C++, allowing for faster computation of statistics given larger non-trivial data sets.
Additions include the choice of biased vs unbiased estimators and the option to use the compensated variant algorithm.
Given a file of 1,000,000 ascending numbers, a simple test on a 2.5GHz dual-core MacBook using Bash time showed sta takes less than a second to complete, compared to 14 seconds using st.
Run ./autogen.sh, ./configure, make, and make install.
sta [options] < file Imagine you have this sample file:
$ cat numbers.txt 1 2 3 4 5 6 7 8 9 10 Running sta is simple:
$ sta < numbers.txt N max min sum mean sd 100 10 1 55 5.5 2.87228 To extract individual bits of information:
$ sta --sum --sd --var < numbers.txt sum sd var 55 2.87228 8.25 sta, by default, assumes you have a population of scores, and thus normalises with N. If in fact you have a sample of scores, and wish to know the expected population standard deviation/variance, i.e. normalise with N-1, then just add the --sample flag. See Standard deviation estimation, and Population variance and sample variance:
$ sta --sum --sd --var --sample < numbers.txt sum sd var 55 3.02765 9.16667 Worried about precision? You can calculate variance instead using the compensated variant algorithm:
$ sta --var --sample --compensated < numbers.txt Want to compute quartiles? Run:
$ sta --q < numbers.txt N min Q1 median Q3 max 100 1 26 50.5 76 100 How about percentiles? Run:
$ sta --p 50,61 < numbers.txt 50th 61th 51 62 Don't want to see the column names? Run:
$ sta --q --brief < numbers.txt 100 1 100 5050 50.5 29.0115 To transpose the output, run:
$ sta --q --transpose < numbers.txt N 100 min 1 Q1 26 median 50.5 Q3 76 max 100 To supply your own delimiter, run:
$ sta --delimiter $'\t\t' --sd --sum < numbers.txt sum sd 55 2.87228 or
$ sta --delimiter XXX --sd --sum < numbers.txt sumXXXsd 55XXX2.87228 sta works with long doubles, and can process numbers in the following formats:
4.7858757E-39 4.7858757e-39 4.7858757 To change the output notation to fixed, supply the --fixed flag.
$ sta --fixed < numbers.txt N min max sum mean sd sderr 10.000000 1.000000 10.000000 55.000000 5.500000 2.872281 0.908295 --brief --compensated --fixed --mean --median --min --max --percentiles --q --q1 --q3 --sample --sd --sderr --sum --transpose The example directory contains 2 scripts to create some example files:
$./examples/create_example_asc.pl n > some_file_with_n_numbers_asc and
$./examples/create_example_rand.pl n > some_file_with_n_numbers_rand To see how long st or sta takes to output the various statistics, call:
$./examples/time_sta.sh examples/large_file_1m and
$./examples/time_st.sh examples/large_file_1m --Add online variant --Add confidence intervals for mean, var, and sd. This should allow for a user supplied interval. I've not written C++ in a long time, so please do send comments, suggestions and bug reports to:
https://github.com/simonccarter/sta/issues
Or fork the code on github:
https://github.com/simonccarter/sta
I've recently integrated a basic testing platform using CxxTest. You'll need to download CxxTest and set the required environment variables to run the tests.
To build the tests, run the buildTester.sh file in the test directory. Then run $ ./tester to run the tests.
Tests are not as extensive as they could be, and contributions are welcome here as well as the main code.