Splitting each row of a correlation matrix into individual files

Question

I have a correlation matrix of 22000 genes and for some analysis, I need to split each row of the matrix into a new file. Which means I need to create 22000 individual files.

I don't want to use the split command (because I want to get the output file as the gene_name.txt) Eg Input file

 IGHD2-15 IGHD3-22 IGHD3-16 IGHD3-10 IGHD2-15 1 0.696084 0.799736 0.818788 IGHD3-22 0.696084 1 0.691419 0.67505 IGHD3-16 0.799736 0.691419 1 0.810656 IGHD3-10 0.818788 0.67505 0.810656 1

Example input is a good first step, but we'll also need an example of the output you'd like to achieve. ;) — n.st
– n.st, Commented Nov 29, 2018 at 23:27
Output file for IGHD2-15: IGHD2-15 1 0.696084 0.799736 0.818788 — Priya
– Priya, Commented Nov 29, 2018 at 23:52
This question is completely on topic and welcome to stay here, but for future reference, you might be interested in our sister site: Bioinformatics. — terdon
– terdon ♦, Commented Nov 29, 2018 at 23:58

terdon · Accepted Answer · 2018-12-01 16:19:41Z

Assuming your gene names are in the first column, all you need is:

awk '{print >> $1".txt"; close(n".txt")}' matrix.txt

That will print each line into a file whose name is the 1st field of that line plus a (completely optional) .txt extension. If you don't want the gene name in the file, use:

awk '{n=$1; $1="";print >> n".txt"; close(n".txt")}' matrix.txt

And, if your first line is a header, use:

awk 'NR>1{print >> $1".txt"; close($1".txt")}' matrix.txt

Finally, in the unlikely case where your file can contain lines whose first field isn't a simple gene name but can contain NULL or a valid path, so you need to sanitize your input, you can use:

awk 'NR > 1 && ($1 ~ /^[A-Z0-9-]+$/) { print >> $1; close($1) }'

Expecting more than 20000 files, you might want to close () each file after printing... — RudiC
– RudiC, Commented Nov 30, 2018 at 18:38
@RudiC why? There is no reason to assume the input file will be sorted, or that the genes will all be unique, that's why I'm using >>. What benefit would there be if I added close(n".txt") so the file would be closed each time? Actually, isn't that what awk will do anyway? How would an explicit close help? — terdon
– terdon ♦, Commented Nov 30, 2018 at 18:43
awk will crash if you exceed the OPEN_MAX system configuration value, or its internal maximum (if different). Like: awk: cannot open "/tmp/1022" for output (Too many open files) and getconf OPEN_MAX 1024 — RudiC
– RudiC, Commented Nov 30, 2018 at 18:53
@rudic I see, yes that makes a lot of sense. Answer edited, thanks! — terdon
– terdon ♦, Commented Dec 1, 2018 at 0:59
@mosvy yes, I know, that's why the last line there is "if your first line is a header". — terdon
– terdon ♦, Commented Dec 1, 2018 at 1:02

Wayne · Accepted Answer · 2018-11-30 00:05:43Z

Since you didn't give and example of what you wanted each file to have in it, or what the files should be named im guessing.

This one will take the file "DATA" from your current directory, create a new file (in the same directory) named after the first column of each row, then fill that file with the data from the rest of the columns.

Meaning

IGHD2-15 1 0.696084 0.799736 0.818788

Creates a file called IGHD2-15 and puts this in it

1 0.696084 0.799736 0.818788

Script:

#!/bin/bash while read -r line; do newFileName="$(echo "$line" | awk '{print $1}')" newFileData="$(echo "$line" | awk '{$1 = ""; print $0}')" echo $newFileData > $newFileName done < DATA

This is going to be very slow for a file this size. Also, as a general rule, using the shell for this sort of thing is a bad idea. — terdon
– terdon ♦, Commented Nov 29, 2018 at 23:56
yea i like yours answer better. I didn't even know you could do that. I tried to make mine easy to understand how to change everything, like the file name and what data to include in the file — Wayne
– Wayne, Commented Nov 30, 2018 at 0:00
Oh, don't worry, I used to do things like this too before I started hanging out here and the local gurus beat it out of me :) — terdon
– terdon ♦, Commented Nov 30, 2018 at 0:01

Praveen Kumar BS · Accepted Answer · 2018-11-30 09:59:37Z

I tried by below method as checked it worked fine too

Here each individual line is copied to new file. file name will be first column of each line

cat data_file.txt IGHD2-15 1 0.696084 0.799736 0.818788 IGHD3-22 0.696084 1 0.691419 0.67505 IGHD3-16 0.799736 0.691419 1 0.810656 IGHD3-10 0.818788 0.67505 0.810656 1 root@praveen_linux_example dev]# j=`cat data_file.txt| wc -l` [root@praveen_linux_example dev]# for ((z=1;z<=$j;z++)); do filename=`awk -v line="$z" 'NR==line{print $1}' data_file.txt`; sed -n ''$z'p' data_file.txt >$filename.txt;done [root@praveen_linux_example dev]#

Stack Exchange Network

Splitting each row of a correlation matrix into individual files

3 Answers 3

You must log in to answer this question.

Linked

Hot Network Questions

Splitting each row of a correlation matrix into individual files

3 Answers 3

You must log in to answer this question.

Linked

Related

Hot Network Questions