2

I have a cluster fasta file (called file) which looks like:

>1AB2 >1AB2 AA NWWIEUNJRNIBGOWNGIOWGRBIGBRGRIOWGI NCIDHFR8EHGBVPIWOBGIGRI >1AB3 AA WNIOREHUEBRGOUERGHBERGIORBGREUGEGO NWFWRUBGREOUEREOBRIOBNERIOBN >1SC4 AA WNIOREHUEBRGOUERGHBERGIORBGREUGEGO NWFWRUBGREOUEREOBRIOBNERIOBN >2CD5 AA WNIOREHUEBRGOUERGHBERGIORBGREUGEGO NWFWRUBGREOUEREOBRIOBNERIOBN >2AC6 >2AC6 AA NFIGEURHGEIROHEGHTUTJGENLJBBEOWRIU NFIROUHBOERVERUGBERUOVREOIBROEBVUE NVHIRE >2ONM AA BUCIEHBUORBREOBWQVURVELLAJFLHIEBGR NHEIBVEURIGBVNRIHEOEAJVSJDNHVUGBVR NEBIBVVBRU >2POD AA BUFEWIBOEUWBWOREBRIUBGUERIGBVOSRIP BUEIBVEO >7KZL >7KZL AA BUIREBVAUREVBREOIRGPNJBFDVERUBVROR >6HG3 >6GH3 AA NBVUIREVOIAWRHRUGRTYUVDNJKDFHUGSEI FHUIERBLUUIREB >6GH4 AA BDFUIGEVUERERHOBERIHBSDLKFJBNIERIH NFHILRUGAURHG 

the about file has 4 groups: 1AB2, 2AC6, 7KZL, and 6GH3. the content during the first >1AB2 and the first >2AC6 belongs to the cluster 1AB2. the content during the first >2AC6 and the first >7KZL belongs to the cluster 2AC6.

I want to separate the file into 4 files at the second >XXXX. each file should look like:

file_1

>1AB2 AA NWWIEUNJRNIBGOWNGIOWGRBIGBRGRIOWGI NCIDHFR8EHGBVPIWOBGIGRI >1AB3 AA WNIOREHUEBRGOUERGHBERGIORBGREUGEGO NWFWRUBGREOUEREOBRIOBNERIOBN >1SC4 AA WNIOREHUEBRGOUERGHBERGIORBGREUGEGO NWFWRUBGREOUEREOBRIOBNERIOBN >2CD5 AA WNIOREHUEBRGOUERGHBERGIORBGREUGEGO NWFWRUBGREOUEREOBRIOBNERIOBN 

file_2

>2AC6 AA NFIGEURHGEIROHEGHTUTJGENLJBBEOWRIU NFIROUHBOERVERUGBERUOVREOIBROEBVUE NVHIRE >2ONM AA BUCIEHBUORBREOBWQVURVELLAJFLHIEBGR NHEIBVEURIGBVNRIHEOEAJVSJDNHVUGBVR NEBIBVVBRU >2POD AA BUFEWIBOEUWBWOREBRIUBGUERIGBVOSRIP BUEIBVEO 

file_3

>7KZL AA BUIREBVAUREVBREOIRGPNJBFDVERUBVROR 

file_4

>6GH3 AA NBVUIREVOIAWRHRUGRTYUVDNJKDFHUGSEI FHUIERBLUUIREB >6GH4 AA BDFUIGEVUERERHOBERIHBSDLKFJBNIERIH NFHILRUGAURHG 
0

3 Answers 3

4
awk '/^>/ && NF==1 {close(out); out="file_"++n; next} {print > out}' file 

Based on your test input, the header, where you want to change the output file, is defined as: the row starting with > and having only one field. Using next we print nothing for this line, but set the output filename. Also a close() call ensures we will not end with too many files open as awk could raise an error for that.


Output:

$ head file_* ==> file_1 <== >1AB2 AA NWWIEUNJRNIBGOWNGIOWGRBIGBRGRIOWGI NCIDHFR8EHGBVPIWOBGIGRI >1AB3 AA WNIOREHUEBRGOUERGHBERGIORBGREUGEGO NWFWRUBGREOUEREOBRIOBNERIOBN >1SC4 AA WNIOREHUEBRGOUERGHBERGIORBGREUGEGO NWFWRUBGREOUEREOBRIOBNERIOBN >2CD5 AA ==> file_2 <== >2AC6 AA NFIGEURHGEIROHEGHTUTJGENLJBBEOWRIU NFIROUHBOERVERUGBERUOVREOIBROEBVUE NVHIRE >2ONM AA BUCIEHBUORBREOBWQVURVELLAJFLHIEBGR NHEIBVEURIGBVNRIHEOEAJVSJDNHVUGBVR NEBIBVVBRU >2POD AA BUFEWIBOEUWBWOREBRIUBGUERIGBVOSRIP ==> file_3 <== >7KZL AA BUIREBVAUREVBREOIRGPNJBFDVERUBVROR ==> file_4 <== >6GH3 AA NBVUIREVOIAWRHRUGRTYUVDNJKDFHUGSEI FHUIERBLUUIREB >6GH4 AA BDFUIGEVUERERHOBERIHBSDLKFJBNIERIH NFHILRUGAURHG thanasis@basis:~/Documents/development/temp> ``` 
0
1

You can use csplit:

csplit --prefix file_ --elide-empty-files --suppress-matched file '/^>....$/' '{*}' 

It creates 4 files, named file_00 to _03 with the content that you need.

0
1

using awk+sed combo:

awk -v f="wfile_" ' /^>/ && length==5 { if (a++) print p, ",", NR-1, f a-1 p=NR+1 } END {print p, ",$" f a}' < file | split -l 10 for f in x*; do sed -nf "$f" file done 

We are using awk to determine the line number of the block initiators /^>.{4}$/ and then construct the appropriate sed code

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.