5

file 1:

HOGBRM443983 -2522.00 19800826 HOGBRM445985 -2389.00 19801101 HOUSAM1891409 -1153.00 19811228 HOUSAM2004289 -650.00 19860101 HOUSAM2005991 -843.00 19860109 HOCANM388722 -1546.00 19860116 HOUSAM2007297 -1882.00 19860125 HOUSAM2007389 -1074.00 19860128 HOITAM801038516 -691.00 19860128 

Columns 2 and 3 include values and birthdate information( year,month,day) of each id from column1, respectively . I want to check how many ids exist within each birth year and what are the average values (from second column) of ids across different years. For example, in file1 there are 2, 1 and 6 ids across years 1980, 1981 and 1986 respectively so the output should be:

output: 1980 2 -2455.5 1981 1 -1153.00 1986 6 -114.33 

in which the first column shows the year of birth, the second column shows a number of ids within each year and the third column is the average values of ids across different years.

Considering that the real data is indeed huge, any suggestion would be appreciated.

0

4 Answers 4

7

With gnu datamash:

cut -c1-35 infile | datamash -W -g 3 count 3 mean 2 

Note that you need process your data first (I used cut as it was the obvious choice with your input sample but any tool will do) so as to remove the month and day from the birth date:

HOGBRM443983 -2522.00 1980 HOGBRM445985 -2389.00 1980 HOUSAM1891409 -1153.00 1981 HOUSAM2004289 -650.00 1986 ...... 

and only then pipe it to datamash.
This also assumes 3rd columns is sorted by year (if it's not sorted use datamash -s -W -g ...)

5

The Awk answer:

awk '{y=substr($3,1,4); c[y]++; s[y]+=$2} END {for (y in c) {print y, c[y], (s[y]/c[y])}}' file.txt 
2

Consider using a real database.

Using a Postgres sandbox set up in a Vagrant VM, I did this using the following steps:

CREATE TABLE MyData (id text, val float, bday date); INSERT INTO MyData VALUES ('HOGBRM443983',-2522.00,'1980-08-26'), ('HOGBRM445985',-2389.00,'1980-11-01'), ('HOUSAM1891409',-1153.00,'1981-12-28'), ('HOUSAM2004289',-650.00,'1986-01-01'), ('HOUSAM2005991',-843.00,'1986-01-09'), ('HOCANM388722',-1546.00,'1986-01-16'), ('HOUSAM2007297',-1882.00,'1986-01-25'), ('HOUSAM2007389',-1074.00,'1986-01-28'), ('HOITAM801038516',-691.00,'1986-01-28') ; SELECT extract(year FROM bday) AS yr, count(id) AS count, avg(val) AS average FROM mydata GROUP BY yr; 

Output is:

 yr | count | average ------+-------+------------------- 1981 | 1 | -1153 1980 | 2 | -2455.5 1986 | 6 | -1114.33333333333 (3 rows) 

You could probably handle this with text processing, but you mention that the data is HUGE, and an actual database is designed for this sort of computation. (And the blog post I linked to has all the steps to get a Postgres sandbox set up.)

2

Miller was created to solve problems like these:

$ cat hogbrm.txt | \ mlr --nidx --repifs put '$3=sub(string($3),"(....).*", "\1")' \ then stats1 -a count,mean -f 2 -g 3 1980 2 -2455.500000 1981 1 -1153.000000 1986 6 -1114.333333 

Context:

  • Use --nidx since there is no header, just positionally indexed columns
  • Use --repifs since multiple spaces separate columns
  • Use sub to drop the last four digits of the date (column 3)
  • Use stats1 to compute the count and mean of column 2 grouped by column 3

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.