Linux - split a big file into small files according to specific character and how many it repeats

Question

I have a really big file that looks like this:

>name1 ACGTACGTACGT ACGTACGTACGT ACGTACGTACGT ACGTACGTACGT >name2 ACGTACGTACGT ACGTACGTACGT ACGTACGTACGT >name ACGTACGTACGT ACGTACGTACGT ACGTACGTACGT ACGTACGTACGT ACGTACGTACGT ACGTACGTACGT ACGTACGTACGT ACGTACGTACGT >name4 ACGTACGTACGT ACGTACGTACGT ACGTACGTACGT ACGTACGTACGT ACGTACGTACGT

It is a fasta file. It has about 3183 lines that start with > (3183 names), followed by random number of lines of ACGTs. I want to split it into smaller files of 250 >s followed by their number of lines of ACGTs. And if the last file does not have 250 >s that is fine. I would still like to keep it. So far I tried split, which I don't think is appropriate here since it splits the file into one > in each small file. I also tried awk:

awk -F'>' 'NR==1{f=0;c=1}NR>1{ c++ if($((c%250))==0) { fn="file"c".fasta"; print > fn} }' kmer_subtraction/kmercollection.fasta

I am not sure if this works because I cannot see my file. Could you please help me with this? Thank you!

This question is welcome and 100% on topic here, but based on the subject matter, you might also be interested in our sister site: Bioinformatics. — terdon
– terdon ♦, Commented Dec 12, 2022 at 12:03

Ed Morton · Accepted Answer · 2022-12-12 16:03:35Z

0

Using any awk:

awk ' />/ { if ( (++c % 250) == 1 ) { close(fn); fn="file"(++n)".fasta" } } { print > fn } ' file.fasta

edited Dec 12, 2022 at 16:03

answered Dec 12, 2022 at 15:57

Ed Morton

35.9k6 gold badges25 silver badges60 bronze badges

1

Thank you it works!

msening
– msening

2022-12-12 16:34:57 +00:00
Commented Dec 12, 2022 at 16:34

Add a comment |

terdon · Accepted Answer · 2022-12-12 11:52:24Z

I have been carrying around a couple of simple, powerful awk scripts written by a colleague years ago, that translate between fasta and "table" format. By "table" format I mean that the fasta file is converted to one entry per line, with the sequence identifier as the first tab-delimited field and the sequence as the second.

You can find the scripts here.

Using those, it is very easy to do what you want since table format is perfect for splitting:

FastaToTbl file.fa > file.tbl split -l 250 file.tbl file.subseq

That will take the input fasta file file.fa and create N files of 250 sequences, one sequence per line, named file.subseqaa, file.subseqab, ..., file.subseqaz. Now, you just need to change these back to fasta asgain:

for file in file.subseq*; do TblToFasta "$file" > "$file.fa" rm "$file" done rm file.tbl

You now have your individual fasta files and the intermediate table format files have been removed.

You can also just write your only little script to do it in one go:

gawk -v n=1 '{ if(/^>/){k++; if(k % 250 == 0){n++; }} print > "file."n".fa"; }' file.fa

Note that this might complain about too many open files if you are not using GNU awk.

Finally, your script wasn't working because you were mixing up shell and awk syntax. The $((c%250)) is shell syntax, in awk you just want if( c % 250 == 0) as I have used above.

You can change portability your gawk to awk with explicit close() — Gilles Quénot
– Gilles Quénot, Commented Dec 12, 2022 at 12:00
@GillesQuenot I know, but since these are only ~12 files, I really doubt it's worth it. — terdon
– terdon ♦, Commented Dec 12, 2022 at 12:02
@terdon that depends which awk version you're using on which platform, just google awk too many output files 10 which will mostly give you results about awk failures with less than 10 output files. Some of the results are about old awk on Solaris but some are about macOS and idk what other awks/platforms limits might be. In my mind at least, just based on questions I've seen over the years, if you have less than 10 open files you're probably fine but if you have 10 or more then YMMV. — Ed Morton
– Ed Morton, Commented Dec 12, 2022 at 15:51
In addition to the potential "too many open files" error print > "file."n".fa" is undefined behavior per POSIX and so will produce a syntax error in some awks, it needs to be print > ("file."n".fa") to work in all awks. — Ed Morton
– Ed Morton, Commented Dec 12, 2022 at 16:00
Just as another "too many open files" limit example, here's a question where nawk failed with "too many open files" at 20 - unix.com/unix-for-advanced-and-expert-users/…. I see several questions saying they hit that limit with nawk. Like I say, idk what that limit is for every awk on every platform but best I can tell if you keep it under 10 you'll be fine (I used to think under about 15 was fine till I started seeing all the questions hitting the limit at 10). — Ed Morton
– Ed Morton, Commented Dec 12, 2022 at 16:10

Stack Exchange Network

Linux - split a big file into small files according to specific character and how many it repeats

2 Answers 2

You must log in to answer this question.

Linked

Hot Network Questions

Linux - split a big file into small files according to specific character and how many it repeats

2 Answers 2

You must log in to answer this question.

Linked

Related

Hot Network Questions