0

I have the follow dadaset called "snp_sol" that have 481974 rows:

 trait effect snp chr pos snp_effect weight variance_explained var_a_hat 1 2 1 1 54 0.2030156E-02 1.251482 0 0 1 2 2 1 689 -0.3726744E-03 0.9660012 0 0 1 2 3 1 1234 0.4801369E-03 0.9823542 0 0 1 2 4 1 1280 -0.1104844E-03 0.9272357 0 0 1 2 5 1 2610 -0.1296295E-02 1.115933 0 0 ... ... ... ... ... ... ... ... ... 1 2 481971 26 4897157 -0.7846317E-04 0.9226092 0 0 1 2 481972 26 4898314 -0.3934468E-03 0.9691408 0 0 1 2 481973 26 4898376 -0.7204678E-03 1.019935 0 0 1 2 481974 26 4898606 -0.1522481E-03 0.9333048 0 0 

I want to get the mean of each 50 values (windows) in the seventh column (weight) and this mean should appear in the place of the values that originate it, as below:

trait effect snp chr pos snp_effect weight variance_explained var_a_hat 1 2 1 1 54 0.2030156E-02 mean of first 50 rows 0 0 ... ... ... ... ... ... ... ... ... 1 2 50 1 4234 0.5801369E-03 mean of first 50 rows 0 0 1 2 51 1 5080 -0.5048544E-03 mean of second set of 50 rows 0 0 ... ... ... ... ... ... ... ... ... 1 2 100 1 12050 -0.4854433E-03 mean of second set of 50 rows 0 0 1 2 101 1 14080 -0.3554433E-03 mean of third set of 50 rows 0 0 ... ... ... ... ... ... ... ... ... 1 2 150 1 14080 -0.7894433E-03 mean of third set of 50 rows 0 0 and so on 1 2 481974 26 4898606 -0.1522481E-03 mean of last rows 0 0 

Note that there should be no windows overlap and in the last window can not have 50 rows.

I'm was trying this code:

NR=$(wc -l "snp_sol" | awk '{print $1}') # Count the number rows window=$((NR/50)) # Defining the number windows int=${window%.*} # Converting to interger it=$((2*int)) # Double the number of windows for i in $(seq 0 50 $it) # for statement with a seq to count the windows do vi=$i # Variable to define the beginning of the window vf=$((vi+50)) # Variable to define the end of window awk -v vi="$vi" -v vf="$vf" '{ if(NR > vi && NR <= vf) # take each window print } ' snp_sol > b.txt # new temporary file to receive the window m=$(awk '{sum+=$7} END {m=sum/NR; print m}' b.txt) # Calculate the mean awk -v mean="$m" '{print $1=$3,$2=mean}' b.txt > $i.temp # save a temporary file with the mean in second column rm b.txt # Remove the file created to calculate the mean done cat *.temp > b.temp # join all temporary files in sequence paste snp_sol b.temp > c.temp awk '{print $1,$2,$3,$4,$5,$6,$7=$11,$8,$9=$10}' c.temp > snp_sol rm *.temp 

However, this is not working. There must be another way to do it, but I don't know how to do it.

The solution of this situation can be preferably using shell script.

Please, can you help me?

Thanks in advance.

2 Answers 2

1

Using GNU datamash, split (GNU coreutils) and awk:

#!/bin/bash # remove header line and split `input_file` into n files `split00000`, `split00001`... # with max. 4 lines each (use `-l50` for your data file) split -d -a5 -l4 <(tail -n+2 input_file) split { head -n1 input_file # add header for fsplit in split*; do mean=$(datamash -W mean 7 < "$fsplit") # calculate mean value awk -v mean="$mean" '{ print $1,$2,$3,$4,$5,$6,mean,$8,$9 }' "$fsplit" done } | column -t > output_file # format as table and write result rm split* # cleanup 

In this script I used your data (dotted lines removed) as input and only 4 values for the mean value.
Replace -l4 with -l50 for your data file in the script. This is pretty much the same as you did, I just let split and datamash do all the work.

Input file:

$ cat input_file trait effect snp chr pos snp_effect weight variance_explained var_a_hat 1 2 1 1 54 0.2030156E-02 1.251482 0 0 1 2 2 1 689 -0.3726744E-03 0.9660012 0 0 1 2 3 1 1234 0.4801369E-03 0.9823542 0 0 1 2 4 1 1280 -0.1104844E-03 0.9272357 0 0 1 2 5 1 2610 -0.1296295E-02 1.115933 0 0 1 2 481971 26 4897157 -0.7846317E-04 0.9226092 0 0 1 2 481972 26 4898314 -0.3934468E-03 0.9691408 0 0 1 2 481973 26 4898376 -0.7204678E-03 1.019935 0 0 1 2 481974 26 4898606 -0.1522481E-03 0.9333048 0 0 

Result:

$ cat output_file trait effect snp chr pos snp_effect weight variance_explained var_a_hat 1 2 1 1 54 0.2030156E-02 1.031768275 0 0 1 2 2 1 689 -0.3726744E-03 1.031768275 0 0 1 2 3 1 1234 0.4801369E-03 1.031768275 0 0 1 2 4 1 1280 -0.1104844E-03 1.031768275 0 0 1 2 5 1 2610 -0.1296295E-02 1.0069045 0 0 1 2 481971 26 4897157 -0.7846317E-04 1.0069045 0 0 1 2 481972 26 4898314 -0.3934468E-03 1.0069045 0 0 1 2 481973 26 4898376 -0.7204678E-03 1.0069045 0 0 1 2 481974 26 4898606 -0.1522481E-03 0.9333048 0 0 
0
awk -v mod=50 ' BEGIN{ if(!mod) {mod=50} }; NR==1 {print;next}; (NR+1) % mod == 0 { $7=sum/count; print; sum=count=0; next; }; {count++; sum+=$7} END { if (((NR+1) % mod)) != 0) { $7=sum/count; print; }; }' snp_sol 

This prints the header line unmodified. Then, for every 50th input line it replaces the value in $7 with the calculated mean and prints the line. It also does the same on the final input line iff it wasn't previously printed.

For all other lines of input, it increments count (a line counter) and adds $7 to sum (which contains the sum of all $7 values in that block of mod input lines).

No temporary files, no repeat runs over the same data, no shell loop forking awk multiple times. Just one simple pass through the input file, with a very simple algorithm.

NOTE: A variable called mod is used instead of hard-coding 50 as the modulus. This defaults to 50 if it isn't specified on the command line with, e.g., -v mod=n.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.