awk/sed split a cluster file in to multiple files

Question

I have a cluster fasta file (called file) which looks like:

>1AB2 >1AB2 AA NWWIEUNJRNIBGOWNGIOWGRBIGBRGRIOWGI NCIDHFR8EHGBVPIWOBGIGRI >1AB3 AA WNIOREHUEBRGOUERGHBERGIORBGREUGEGO NWFWRUBGREOUEREOBRIOBNERIOBN >1SC4 AA WNIOREHUEBRGOUERGHBERGIORBGREUGEGO NWFWRUBGREOUEREOBRIOBNERIOBN >2CD5 AA WNIOREHUEBRGOUERGHBERGIORBGREUGEGO NWFWRUBGREOUEREOBRIOBNERIOBN >2AC6 >2AC6 AA NFIGEURHGEIROHEGHTUTJGENLJBBEOWRIU NFIROUHBOERVERUGBERUOVREOIBROEBVUE NVHIRE >2ONM AA BUCIEHBUORBREOBWQVURVELLAJFLHIEBGR NHEIBVEURIGBVNRIHEOEAJVSJDNHVUGBVR NEBIBVVBRU >2POD AA BUFEWIBOEUWBWOREBRIUBGUERIGBVOSRIP BUEIBVEO >7KZL >7KZL AA BUIREBVAUREVBREOIRGPNJBFDVERUBVROR >6HG3 >6GH3 AA NBVUIREVOIAWRHRUGRTYUVDNJKDFHUGSEI FHUIERBLUUIREB >6GH4 AA BDFUIGEVUERERHOBERIHBSDLKFJBNIERIH NFHILRUGAURHG

the about file has 4 groups: 1AB2, 2AC6, 7KZL, and 6GH3. the content during the first >1AB2 and the first >2AC6 belongs to the cluster 1AB2. the content during the first >2AC6 and the first >7KZL belongs to the cluster 2AC6.

I want to separate the file into 4 files at the second >XXXX. each file should look like:

file_1

>1AB2 AA NWWIEUNJRNIBGOWNGIOWGRBIGBRGRIOWGI NCIDHFR8EHGBVPIWOBGIGRI >1AB3 AA WNIOREHUEBRGOUERGHBERGIORBGREUGEGO NWFWRUBGREOUEREOBRIOBNERIOBN >1SC4 AA WNIOREHUEBRGOUERGHBERGIORBGREUGEGO NWFWRUBGREOUEREOBRIOBNERIOBN >2CD5 AA WNIOREHUEBRGOUERGHBERGIORBGREUGEGO NWFWRUBGREOUEREOBRIOBNERIOBN

file_2

>2AC6 AA NFIGEURHGEIROHEGHTUTJGENLJBBEOWRIU NFIROUHBOERVERUGBERUOVREOIBROEBVUE NVHIRE >2ONM AA BUCIEHBUORBREOBWQVURVELLAJFLHIEBGR NHEIBVEURIGBVNRIHEOEAJVSJDNHVUGBVR NEBIBVVBRU >2POD AA BUFEWIBOEUWBWOREBRIUBGUERIGBVOSRIP BUEIBVEO

file_3

>7KZL AA BUIREBVAUREVBREOIRGPNJBFDVERUBVROR

file_4

>6GH3 AA NBVUIREVOIAWRHRUGRTYUVDNJKDFHUGSEI FHUIERBLUUIREB >6GH4 AA BDFUIGEVUERERHOBERIHBSDLKFJBNIERIH NFHILRUGAURHG

thanasisp · Accepted Answer · 2022-05-16 16:45:45Z

awk '/^>/ && NF==1 {close(out); out="file_"++n; next} {print > out}' file

Based on your test input, the header, where you want to change the output file, is defined as: the row starting with > and having only one field. Using next we print nothing for this line, but set the output filename. Also a close() call ensures we will not end with too many files open as awk could raise an error for that.

Output:

$ head file_* ==> file_1 <== >1AB2 AA NWWIEUNJRNIBGOWNGIOWGRBIGBRGRIOWGI NCIDHFR8EHGBVPIWOBGIGRI >1AB3 AA WNIOREHUEBRGOUERGHBERGIORBGREUGEGO NWFWRUBGREOUEREOBRIOBNERIOBN >1SC4 AA WNIOREHUEBRGOUERGHBERGIORBGREUGEGO NWFWRUBGREOUEREOBRIOBNERIOBN >2CD5 AA ==> file_2 <== >2AC6 AA NFIGEURHGEIROHEGHTUTJGENLJBBEOWRIU NFIROUHBOERVERUGBERUOVREOIBROEBVUE NVHIRE >2ONM AA BUCIEHBUORBREOBWQVURVELLAJFLHIEBGR NHEIBVEURIGBVNRIHEOEAJVSJDNHVUGBVR NEBIBVVBRU >2POD AA BUFEWIBOEUWBWOREBRIUBGUERIGBVOSRIP ==> file_3 <== >7KZL AA BUIREBVAUREVBREOIRGPNJBFDVERUBVROR ==> file_4 <== >6GH3 AA NBVUIREVOIAWRHRUGRTYUVDNJKDFHUGSEI FHUIERBLUUIREB >6GH4 AA BDFUIGEVUERERHOBERIHBSDLKFJBNIERIH NFHILRUGAURHG thanasis@basis:~/Documents/development/temp> ```

schrodingerscatcuriosity · Accepted Answer · 2022-05-16 16:57:42Z

You can use csplit:

csplit --prefix file_ --elide-empty-files --suppress-matched file '/^>....$/' '{*}'

It creates 4 files, named file_00 to _03 with the content that you need.

guest_7 · Accepted Answer · 2022-05-20 07:46:20Z

using awk+sed combo:

awk -v f="wfile_" ' /^>/ && length==5 { if (a++) print p, ",", NR-1, f a-1 p=NR+1 } END {print p, ",$" f a}' < file | split -l 10 for f in x*; do sed -nf "$f" file done

We are using awk to determine the line number of the block initiators /^>.{4}$/ and then construct the appropriate sed code

Stack Exchange Network

awk/sed split a cluster file in to multiple files

3 Answers 3

You must log in to answer this question.

Linked

Hot Network Questions

awk/sed split a cluster file in to multiple files

3 Answers 3

You must log in to answer this question.

Linked

Related

Hot Network Questions