how to calculate average of values in a column by considering the information from another column?

Question

file 1:

HOGBRM443983 -2522.00 19800826 HOGBRM445985 -2389.00 19801101 HOUSAM1891409 -1153.00 19811228 HOUSAM2004289 -650.00 19860101 HOUSAM2005991 -843.00 19860109 HOCANM388722 -1546.00 19860116 HOUSAM2007297 -1882.00 19860125 HOUSAM2007389 -1074.00 19860128 HOITAM801038516 -691.00 19860128

Columns 2 and 3 include values and birthdate information( year,month,day) of each id from column1, respectively . I want to check how many ids exist within each birth year and what are the average values (from second column) of ids across different years. For example, in file1 there are 2, 1 and 6 ids across years 1980, 1981 and 1986 respectively so the output should be:

output: 1980 2 -2455.5 1981 1 -1153.00 1986 6 -114.33

in which the first column shows the year of birth, the second column shows a number of ids within each year and the third column is the average values of ids across different years.

Considering that the real data is indeed huge, any suggestion would be appreciated.

Community · Accepted Answer · 2020-06-11 14:16:50Z

With gnu datamash:

cut -c1-35 infile | datamash -W -g 3 count 3 mean 2

Note that you need process your data first (I used cut as it was the obvious choice with your input sample but any tool will do) so as to remove the month and day from the birth date:

HOGBRM443983 -2522.00 1980 HOGBRM445985 -2389.00 1980 HOUSAM1891409 -1153.00 1981 HOUSAM2004289 -650.00 1986 ......

and only then pipe it to datamash.
This also assumes 3rd columns is sorted by year (if it's not sorted use datamash -s -W -g ...)

Wildcard · Accepted Answer · 2016-11-16 22:50:32Z

The Awk answer:

awk '{y=substr($3,1,4); c[y]++; s[y]+=$2} END {for (y in c) {print y, c[y], (s[y]/c[y])}}' file.txt

Wildcard · Accepted Answer · 2016-11-16 22:14:48Z

Consider using a real database.

Using a Postgres sandbox set up in a Vagrant VM, I did this using the following steps:

CREATE TABLE MyData (id text, val float, bday date); INSERT INTO MyData VALUES ('HOGBRM443983',-2522.00,'1980-08-26'), ('HOGBRM445985',-2389.00,'1980-11-01'), ('HOUSAM1891409',-1153.00,'1981-12-28'), ('HOUSAM2004289',-650.00,'1986-01-01'), ('HOUSAM2005991',-843.00,'1986-01-09'), ('HOCANM388722',-1546.00,'1986-01-16'), ('HOUSAM2007297',-1882.00,'1986-01-25'), ('HOUSAM2007389',-1074.00,'1986-01-28'), ('HOITAM801038516',-691.00,'1986-01-28') ; SELECT extract(year FROM bday) AS yr, count(id) AS count, avg(val) AS average FROM mydata GROUP BY yr;

Output is:

 yr | count | average ------+-------+------------------- 1981 | 1 | -1153 1980 | 2 | -2455.5 1986 | 6 | -1114.33333333333 (3 rows)

You could probably handle this with text processing, but you mention that the data is HUGE, and an actual database is designed for this sort of computation. (And the blog post I linked to has all the steps to get a Postgres sandbox set up.)

Satō Katsura · Accepted Answer · 2016-11-17 04:50:29Z

Miller was created to solve problems like these:

$ cat hogbrm.txt | \ mlr --nidx --repifs put '$3=sub(string($3),"(....).*", "\1")' \ then stats1 -a count,mean -f 2 -g 3 1980 2 -2455.500000 1981 1 -1153.000000 1986 6 -1114.333333

Context:

Use --nidx since there is no header, just positionally indexed columns
Use --repifs since multiple spaces separate columns
Use sub to drop the last four digits of the date (column 3)
Use stats1 to compute the count and mean of column 2 grouped by column 3

Stack Exchange Network

how to calculate average of values in a column by considering the information from another column?

4 Answers 4

You must log in to answer this question.

Linked

Hot Network Questions

how to calculate average of values in a column by considering the information from another column?

4 Answers 4

You must log in to answer this question.

Linked

Related

Hot Network Questions