Skip to main content
Became Hot Network Question
edited tags
Link
Kusalananda
  • 356.2k
  • 42
  • 737
  • 1.1k
Source Link
Matteo
  • 321
  • 1
  • 8

reformat data for PCA with AWK

I'm working on wrangling some data into a matrix format for a PCA analysis. I have attempted a more straightforward version with R, but pivot_wider() fails due to the massive size of this dataset (~4.5M rows).

Anyway, I do have ~450 files organized as such

HG00097 sample, haplotype 1 chromosome 1

HG00097#1_chr1 D1000 15 HG00097#1_chr1 D1001 196 HG00097#1_chr1 D1002 48 HG00097#1_chr1 D1003 55 HG00097#1_chr1 D1012 38 HG00097#1_chr1 D1018 2 

HG00099 sample, haplotype 2 chromosome 2

HG00099#2_chr2 D1000 6 HG00099#2_chr2 D1001 36 HG00099#2_chr2 D1002 5 HG00099#2_chr2 D1003 22 HG00099#2_chr2 D1012 1 HG00099#2_chr2 D1013 1 

where the first column is a unique identifier for [sample]#[hap (either 1 or 2)]_[chr (from 1 to 22 followed by either X or Y)], the second is a value, and the third how many times that value appears in a specific chromosome.

What I need to do is to organize all those files in one matrix with all individuals stacked in the first row as rownames, and those D values as headers for the cells containing the corresponding occurrences. The idea is that a D value has to be considered only once globally and for those where one sample is missing it, that has to be replaced with 0.

Based on the two samples above I should get the following:

| | D1000 | D1001 | D1002 | D1003 | D1012 | D1013 | D1018 | |----------------|-------|-------|-------|-------|-------|-------|-------| | HG00097#1_chr1 | 15 | 196 | 48 | 55 | 38 | 0 | 2 | | HG00099#2_chr2 | 6 | 36 | 5 | 22 | 1 | 1 | 0 | 

Additional info:

  • files for the two haplotypes (#1 and #2) are separated and sorted according to the D value from smallest to largest within each chromosome, and chromosomes in both files are sorted from 1 to 22 with X or Y at the end

I was considering to use AWK for the task but I'm not even sure from where to start... any help is appreciated. Let me know whether additional info is needed, thanks!