I'm working on wrangling some data into a matrix format for a PCA analysis. I have attempted a more straightforward version with R, but pivot_wider() fails due to the massive size of this dataset (~4.5M rows).
Anyway, I do have ~450 files organized as such
HG00097 sample, haplotype 1 chromosome 1
HG00097#1_chr1 D1000 15 HG00097#1_chr1 D1001 196 HG00097#1_chr1 D1002 48 HG00097#1_chr1 D1003 55 HG00097#1_chr1 D1012 38 HG00097#1_chr1 D1018 2 HG00099 sample, haplotype 2 chromosome 2
HG00099#2_chr2 D1000 6 HG00099#2_chr2 D1001 36 HG00099#2_chr2 D1002 5 HG00099#2_chr2 D1003 22 HG00099#2_chr2 D1012 1 HG00099#2_chr2 D1013 1 where the first column is a unique identifier for [sample]#[hap (either 1 or 2)]_[chr (from 1 to 22 followed by either X or Y)], the second is a value, and the third how many times that value appears in a specific chromosome.
What I need to do is to organize all those files in one matrix with all individuals stacked in the first row as rownames, and those D values as headers for the cells containing the corresponding occurrences. The idea is that a D value has to be considered only once globally and for those where one sample is missing it, that has to be replaced with 0.
Based on the two samples above I should get the following:
| | D1000 | D1001 | D1002 | D1003 | D1012 | D1013 | D1018 | |----------------|-------|-------|-------|-------|-------|-------|-------| | HG00097#1_chr1 | 15 | 196 | 48 | 55 | 38 | 0 | 2 | | HG00099#2_chr2 | 6 | 36 | 5 | 22 | 1 | 1 | 0 | Additional info:
- files for the two haplotypes (#1 and #2) are separated and sorted according to the
Dvalue from smallest to largest within each chromosome, and chromosomes in both files are sorted from 1 to 22 with X or Y at the end
I was considering to use AWK for the task but I'm not even sure from where to start... any help is appreciated. Let me know whether additional info is needed, thanks!
awkis unlikely to fare any better.#1or#2, right? You never have bothHG...#1andHG...#2in the same file? And you don't actually want the|and-in the output, right? What should the field separators be? Tabs perhaps? Oh, and you have nochrMorMTthen?tabas in the inputs. Also, yes no chrM/MT in this case. Although, it might be worth mention not all samples names start with HG but I don't think is necessarily a problem.followed by either X or Y, c) any sorting requirements for the final result and d) the actualR/pivot_wider()error message(s)