I'm working on wrangling some data into a matrix format for a PCA analysis. I have attempted a more straightforward version with R, but pivot_wider() fails due to the massive size of this dataset (~4.5M rows).
Anyway, I do have ~450 files organized as such
HG00097 sample, haplotype 1 chromosome 1
HG00097#1_chr1 D1000 15 HG00097#1_chr1 D1001 196 HG00097#1_chr1 D1002 48 HG00097#1_chr1 D1003 55 HG00097#1_chr1 D1012 38 HG00097#1_chr1 D1018 2 HG00099 sample, haplotype 2 chromosome 2
HG00099#2_chr2 D1000 6 HG00099#2_chr2 D1001 36 HG00099#2_chr2 D1002 5 HG00099#2_chr2 D1003 22 HG00099#2_chr2 D1012 1 HG00099#2_chr2 D1013 1 where the first column is a unique identifier for [sample]#[hap (either 1 or 2)]_[chr (from 1 to 22 followed by either X or Y)], the second is a value, and the third how many times that value appears in a specific chromosome.
What I need to do is to organize all those files in one matrix with all individuals stacked in the first row as rownames, and those D values as headers for the cells containing the corresponding occurrences. The idea is that a D value has to be considered only once globally and for those where one sample is missing it, that has to be replaced with 0.
Based on the two samples above I should get the following:
| | D1000 | D1001 | D1002 | D1003 | D1012 | D1013 | D1018 | |----------------|-------|-------|-------|-------|-------|-------|-------| | HG00097#1_chr1 | 15 | 196 | 48 | 55 | 38 | 0 | 2 | | HG00099#2_chr2 | 6 | 36 | 5 | 22 | 1 | 1 | 0 | Additional info:
- files for the two haplotypes (#1 and #2) are separated and sorted according to the
Dvalue from smallest to largest within each chromosome, and chromosomes in both files are sorted from 1 to 22 with X or Y at the end
I was considering to use AWK for the task but I'm not even sure from where to start... any help is appreciated. Let me know whether additional info is needed, thanks!