3

I'm working on wrangling some data into a matrix format for a PCA analysis. I have attempted a more straightforward version with R, but pivot_wider() fails due to the massive size of this dataset (~4.5M rows).

Anyway, I do have ~450 files organized as such

HG00097 sample, haplotype 1 chromosome 1

HG00097#1_chr1 D1000 15 HG00097#1_chr1 D1001 196 HG00097#1_chr1 D1002 48 HG00097#1_chr1 D1003 55 HG00097#1_chr1 D1012 38 HG00097#1_chr1 D1018 2 

HG00099 sample, haplotype 2 chromosome 2

HG00099#2_chr2 D1000 6 HG00099#2_chr2 D1001 36 HG00099#2_chr2 D1002 5 HG00099#2_chr2 D1003 22 HG00099#2_chr2 D1012 1 HG00099#2_chr2 D1013 1 

where the first column is a unique identifier for [sample]#[hap (either 1 or 2)]_[chr (from 1 to 22 followed by either X or Y)], the second is a value, and the third how many times that value appears in a specific chromosome.

What I need to do is to organize all those files in one matrix with all individuals stacked in the first row as rownames, and those D values as headers for the cells containing the corresponding occurrences. The idea is that a D value has to be considered only once globally and for those where one sample is missing it, that has to be replaced with 0.

Based on the two samples above I should get the following:

| | D1000 | D1001 | D1002 | D1003 | D1012 | D1013 | D1018 | |----------------|-------|-------|-------|-------|-------|-------|-------| | HG00097#1_chr1 | 15 | 196 | 48 | 55 | 38 | 0 | 2 | | HG00099#2_chr2 | 6 | 36 | 5 | 22 | 1 | 1 | 0 | 

Additional info:

  • files for the two haplotypes (#1 and #2) are separated and sorted according to the D value from smallest to largest within each chromosome, and chromosomes in both files are sorted from 1 to 22 with X or Y at the end

I was considering to use AWK for the task but I'm not even sure from where to start... any help is appreciated. Let me know whether additional info is needed, thanks!

5
  • if R fails, awk is unlikely to fare any better. Commented yesterday
  • So each file will only have either #1 or #2, right? You never have both HG...#1 and HG...#2 in the same file? And you don't actually want the | and - in the output, right? What should the field separators be? Tabs perhaps? Oh, and you have no chrM or MT then? Commented yesterday
  • @terdon yes to the first question and yes to the second. I forgot to mention the field separator should be a tab as in the inputs. Also, yes no chrM/MT in this case. Although, it might be worth mention not all samples names start with HG but I don't think is necessarily a problem. Commented yesterday
  • please update the question with a) actual file names for the 2 samples, b) example of an identifier that demonstrates followed by either X or Y, c) any sorting requirements for the final result and d) the actual R/pivot_wider() error message(s) Commented 11 hours ago
  • 1
    What do you mean by 4.5M rows with 450 files? Does that mean about 10,000 lines per file so at least 10000 columns in the resulting output? Commented 7 hours ago

4 Answers 4

4

I'd use perl instead:

$ perl -lane ' $f{$F[0]}->{$F[1]} = $F[2]; $d{$F[1]} = 0; END { $, = "\t"; @d = sort keys %d; print "", @d; for $f (sort keys %f) { print $f, map {$f{$f}->{$_} // 0} @d; } }' -- file1 file2 | mlr --t2p --barred cat +----------------+-------+-------+-------+-------+-------+-------+-------+ | | D1000 | D1001 | D1002 | D1003 | D1012 | D1013 | D1018 | +----------------+-------+-------+-------+-------+-------+-------+-------+ | HG00097#1_chr1 | 15 | 196 | 48 | 55 | 38 | 0 | 2 | | HG00099#2_chr2 | 6 | 36 | 5 | 22 | 1 | 1 | 0 | +----------------+-------+-------+-------+-------+-------+-------+-------+ 

Or an English version:

perl -MEnglish -lane ' ($sample, $value, $count) = @F; $count{$sample}->{$value} = $count; $all_known_values{$value} = undef; END { $OUTPUT_FIELD_SEPARATOR = "\N{CHARACTER TABULATION}"; @all_known_values = sort keys %all_known_values; print "Sample", @all_known_values; foreach $sample (sort keys %count) { print $sample, map {$count{$sample}->{$ARG} // 0} @all_known_values; } }' -- file1 file2 

perl generates tab-separated-values output. mlr is just to pretty print it. Change "\t" to "," and t2p to c2p for csv instead.

file1 and file2 here are two files with the examples you gave. You'd give the full list instead.

3

Using Miller (mlr) to read the input as whitespace-delimited fields, reshape it by using the values found in the 2nd field as column labels, with the corresponding values taken from the 3rd field:

$ cat file* | mlr --n2p --barred reshape -s 2,3 then unsparsify --fill-with 0 then label id +----------------+-------+-------+-------+-------+-------+-------+-------+ | id | D1000 | D1001 | D1002 | D1003 | D1012 | D1018 | D1013 | +----------------+-------+-------+-------+-------+-------+-------+-------+ | HG00097#1_chr1 | 15 | 196 | 48 | 55 | 38 | 2 | 0 | | HG00099#2_chr2 | 6 | 36 | 5 | 22 | 1 | 0 | 1 | +----------------+-------+-------+-------+-------+-------+-------+-------+ 

Or, for CSV output:

$ cat file* | mlr --n2c reshape -s 2,3 then unsparsify --fill-with 0 then label id id,D1000,D1001,D1002,D1003,D1012,D1018,D1013 HG00097#1_chr1,15,196,48,55,38,2,0 HG00099#2_chr2,6,36,5,22,1,0,1 

Since CSV (and the pretty-print format used in the first example) requires that every record has the same number of fields, we use the unsparsify operation to add zeros to the fields that don't have values. We also use the label operation to set the first field's label to id (its label would otherwise default to 1).

If you need to sort the fields lexicographically, then do so with sort-within-records before relabeling the first field (doing it after would make the first field be sorted to the end, in this case):

mlr --n2c reshape -s 2,3 then unsparsify --fill-with 0 then sort-within-records then label id 

Testing on 4.5 million lines of input, this runs in well under a minute (about 12s on my system, using about 2 GB of RAM).

1

if R fails, awk is unlikely to fare any better. Your problem is that you need to pivot multiple large matrices.

the massive size of this dataset (~4.5M rows).

That does not sound massive to me; that's just 4.5 million integers and their position in a table? One could argue that's actually not very much data, and could fit on a good computer screen as pixels. Your computer computes dozens of screens per second, so that's not too much for it :)

Also, applying regexes through 4.5 million lines of text is not taking much time on a modern computer. So, I think you'll be fine!

This becomes very much data, however, if you're not handling it compactly.

I'd do this: as preliminary step (because 4.5 million rows are not that much, we can stand going through all the files in a preliminary processing step)

  • Determine the number of columns you need. It seems to me that they are all Dxxxx with xxxx being an integer smaller than 10,000? Either you can directly infer the unique values, or you need to count them.
  • Determine the number of rows you need. That probably means putting every occurring value of your first column in a (reverse indexable) set data structure, and checking the size of that set in the end.

Then,

  • allocate a matrix with that number of rows and columns (and have it be zerofilled, which is usually almost free)
  • for each row in your input, look up the row index in your row set, and derive the column index from the D-value. Put the number from your last column at that position in your table.

This matrix is what you seem to do PCA on. Go wild!


I'd implement that in a few lines of C++ or Python, but I'm taking bets R also has functionality to read files row-wise, split them on a delimiter, and store the first value in a set-like data structure, and so on.

1

Using any POSIX awk plus sort:

$ cat tst.sh #!/usr/bin/env bash awk ' BEGIN { OFS="\t" } NR == FNR { haps[++numHaps] = $2 next } FNR == 1 { prt() sample = $1 } { cnts[$2] = $3 } END { prt() } function prt( hapNr) { if ( doneHdr++ ) { printf "%s%s", sample, OFS for ( hapNr=1; hapNr<=numHaps; hapNr++ ) { printf "%d%s", cnts[haps[hapNr]], (hapNr < numHaps ? OFS : ORS) } } else { printf "%s%s", "Sample", OFS for ( hapNr=1; hapNr<=numHaps; hapNr++ ) { printf "%s%s", haps[hapNr], (hapNr < numHaps ? OFS : ORS) } } delete cnts } ' <(sort -u -k2,2 "$@") "$@" 

$ ./tst.sh file1 file2 Sample D1000 D1001 D1002 D1003 D1012 D1013 D1018 HG00097#1_chr1 15 196 48 55 38 0 2 HG00099#2_chr2 6 36 5 22 1 1 0 

That reads all the unique haplotypes plus the haplotypes+counts from just 1 file at a time into memory in awk (as opposed to reading all of the values from all of the files into memory at once) and uses sort just to sort the haplotypes uniquely so it should have no problem with the amount of input you describe.

You can pipe the above tab-separated output to whatever you like to format it differently, e.g.:

$ ./tst.sh file1 file2 | column -t -s$'\t' -o' | ' -R0 Sample | D1000 | D1001 | D1002 | D1003 | D1012 | D1013 | D1018 HG00097#1_chr1 | 15 | 196 | 48 | 55 | 38 | 0 | 2 HG00099#2_chr2 | 6 | 36 | 5 | 22 | 1 | 1 | 0 

or you could add some formatting code to the awk script if you prefer.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.