How to split multifasta based on partial fasta header

Question

I have download a batch of refseq fasta files and want to split them based on strain. this is complicated by accession numbers, project IDs etc. I downloaded the genomes via eDirect from NCBI. I know I can split the output but I want them together for some preliminary parsing (remove ambiguous and plasmid sequences etc.)

See example below.

>NZ_AESB01000237.1 Clostridium botulinum BKT028387 contig00237 ATGC >NZ_AESB01000236.1 Clostridium botulinum BKT028387 contig00236 ATGC >NZ_ALNK01000035.1 Peptoanaerobacter stomatis strain OBRC8 ctg120007290634 ATGC >NZ_ALNK01000031.1 Peptoanaerobacter stomatis strain OBRC8 ctg120007290635 ATGC

I want two files: Clostridium botulinum BKT028387.fas and Peptoanaerobacter stomatis OBRC8.fas

My actual dataset is much larger (~70,000 genomes) and not too sure about the headers in each case.

Im thinking the following could work but Im not sure how to execute it:

Using a grep command to extract the lines but split the output.

If I download the summary list from refseq and use the names as the query example:

grep -Fwf refseq.list sequences.fa

Summary file

Peptoanaerobacter stomatis strain OBRC8, whole genome shotgun sequencing project 7,810,226 rc other DNA NZ_AJXR00000000.1 GI:938474759

Clostridium botulinum BKT028387, whole genome shotgun sequencing project 4,134,593 rc other DNA NZ_NOJY00000000.1 GI:1231817869

But I could edit it down to

Peptoanaerobacter stomatis strain OBRC8

Clostridium botulinum BKT028387

Am I one the right track. How do I finish this?

What operating system are you using? I asked you in your previous question as well but you didn't reply. The tools available to you and their behavior change depending on your OS. — terdon
– terdon ♦, Commented Dec 8, 2017 at 10:14
Unix? Do you mean an actual Unix or macOS or Linux? Please tell us your actual OS, a unix-like system can be assumed if you're doing bioinformatics usually. — terdon
– terdon ♦, Commented Dec 8, 2017 at 10:18
Great, then you should have the GNU tools. Good. Have a look at the updated answer. What does the summary list look like? — terdon
– terdon ♦, Commented Dec 8, 2017 at 10:25
Ah! That summary file can simplify things enormously! Can you link to the entire file? Does it have all the strains and does it have each strain in exactly the same format as it appears in the fasta file header? I am afraid it won't since I've already found inconsistencies. It might be simpler to start from the beginning and download each strain separately. — terdon
– terdon ♦, Commented Dec 8, 2017 at 10:38

Steve Bond · Accepted Answer · 2017-12-14 04:00:24Z

The SeqBuddy --pull_records command can grab sequences based on a regular expression against record IDs and/or metadata. Given your example:

$: seqbuddy.py sequences.fa --pull_records "Clostridium botulinum.*BKT028387" full > Clostridium_botulinum_BKT028387.fas $: seqbuddy.py sequences.fa --pull_records "Peptoanaerobacter stomatis.*OBRC8" full > Peptoanaerobacter_stomatis_OBRC8.fas

This looks great. Is there a way to set this up to loop using an input list of names? — AudileF
– AudileF, Commented Dec 15, 2017 at 11:54

Nils Giordano · Accepted Answer · 2018-03-29 15:50:22Z

I had a similar problem and found a solution using awk. Since this is the first time I use awk, the syntax might not be perfect, but you should get an idea of what is possible.

awk -F' ' '{if(/^>/){sub("strain ", ""); name=$2" "$3" "$4; print > name".fa"}else{print > name".fa"}}' yourfile.fa

(more information below)

This should transform a file like this:

yourfile.fa

>NZ_AESB01000237.1 Clostridium botulinum BKT028387 contig00237 ATGC ATGCATGC ATGCATGC >NZ_AESB01000236.1 Clostridium botulinum BKT028387 contig00236 ATGC ATGCATGC >NZ_ALNK01000035.1 Peptoanaerobacter stomatis strain OBRC8 ctg120007290634 ATGC ATGCATGC ATGCATGC >NZ_ALNK01000031.1 Peptoanaerobacter stomatis strain OBRC8 ctg120007290635 ATGC ATGCATGC

Into two files like this:

'Clostridium botulinum BKT028387.fa'

>NZ_AESB01000237.1 Clostridium botulinum BKT028387 contig00237 ATGC ATGCATGC ATGCATGC >NZ_AESB01000236.1 Clostridium botulinum BKT028387 contig00236 ATGC ATGCATGC

'Peptoanaerobacter stomatis OBRC8.fa'

>NZ_ALNK01000035.1 Peptoanaerobacter stomatis OBRC8 ctg120007290634 ATGC ATGCATGC ATGCATGC >NZ_ALNK01000031.1 Peptoanaerobacter stomatis OBRC8 ctg120007290635 ATGC ATGCATGC

How it works

-F' ' tells awk to split the line by space (accessible with $1, $2, ...)
if(/^>/) looks for line starting with the > character, then sub("strain ", "") remove the word strain and name=$2" "$3" "$4; print > name".fa" print the line into a new file with a name based on what you asked for.
else if the line does not start with >, {print > name".fa"} appends it to a file using the current value of name.

This solution will work regardless of the number of lines between the fasta headers. Note that I won't recommend using spaces in filenames, so I would replace name=$2" "$3" "$4 with something like name=$2"_"$3"_"$4. I used this solution to split a large fasta file (>5 GB) containing >5k genomes with >25k scaffolds and I found it fast and rather memory efficient.

Thank you so much for the code and explanation. However, when I tried it, it gave me the following error. Any ideas? awk: syntax error at source line 1 context is {if(/^>/){ name=$2" "$3" "$4; print > >>> name".fa" <<< awk: illegal statement at source line 1 awk: syntax error at source line 1 — user1995
– user1995, Commented Sep 3, 2021 at 17:45
I'm not sure about your error but you can probably put the code on several lines for easier debugging, and execute it with the awk -f option like this: awk -f myscript.awk mygenomes.fasta. — Nils Giordano
– Nils Giordano, Commented Sep 4, 2021 at 19:15

terdon · Accepted Answer · 2017-12-08 10:24:54Z

I would convert the fasta file to a format with one line per record (I have already posted the FastaToTbl and TblToFasta scripts I use for this before), and then look for a specific pattern:

FastaToTbl file.fa | grep 'BKT028387' | TblToFasta > clostridium.fa FastaToTbl file.fa | grep 'OBRC8' | TblToFasta > peptoanaerobacter.fa

I have downloaded a few sequences using the same command you described in your previous question, and it looks like the headers are not standardized. For example, while you have many that are in the format of $species strain $strain, like these:

Clostridioides difficile strain 7496-D/ST9 Clostridioides difficile strain 5616-DH/ST42, Peptoanaerobacter stomatis strain OBRC8 Romboutsia maritimum strain CCRI-22766 Tepidibacter mesophilus strain JCM

Others don't have the word strain and just give the strain:

Peptostreptococcus anaerobius 653-L contig00088, Peptoclostridium acidaminophilum DSM 3953

So, as a first step, you could deal with the ones that do have the word strain as the 4th word of the header line (the 1st is the seqID which I have not included in the examples above) and the actual strain as the 5th and get those out of the way:

FastaToTbl foo.fa | awk '($4=="strain"){gsub("/","_"); print $0 >> $2"_"$3"_"$5".tbl"}'

That will look for lines whose 4th field is strain, then it will replace any / in the line with _ (there's a strain called CF/ST37 and that cannot be in a filename) and print into a file whose name is the 2nd, 3rd and 5th fields of the line concatenated. On my test file, this produced the following files (as you can see, some are wrong since there is a comma after the strain, but that's because they're not standardized):

$ ls *tbl Clostridioides_difficile_5479-NonSp_ST229,.tbl Clostridioides_difficile_6589-DH_ST42.tbl Clostridioides_difficile_7457-NonSp_novelST,.tbl Clostridioides_difficile_5479-NonSp_ST229.tbl Clostridioides_difficile_6592-G_ST8,.tbl Clostridioides_difficile_7457-NonSp_novelST.tbl Clostridioides_difficile_5480-E_ST6,.tbl Clostridioides_difficile_6592-G_ST8.tbl Clostridioides_difficile_7458-N_ST10,.tbl Clostridioides_difficile_5480-E_ST6.tbl Clostridioides_difficile_6593-Y_ST110,.tbl Clostridioides_difficile_7458-N_ST10.tbl Clostridioides_difficile_5482-L_ST55,.tbl Clostridioides_difficile_6593-Y_ST110.tbl Clostridioides_difficile_7459-G_ST8,.tbl Clostridioides_difficile_5482-L_ST55.tbl Clostridioides_difficile_6595-AL_ST58,.tbl Clostridioides_difficile_7459-G_ST8.tbl Clostridioides_difficile_5485-Y_ST2,.tbl Clostridioides_difficile_6595-AL_ST58.tbl Clostridioides_difficile_7462-DH_ST42,.tbl Clostridioides_difficile_5485-Y_ST2.tbl Clostridioides_difficile_6596-D_ST9,.tbl Clostridioides_difficile_7462-DH_ST42.tbl Clostridioides_difficile_5488-DH_ST42,.tbl Clostridioides_difficile_6596-D_ST9.tbl Clostridioides_difficile_7463-E_ST6,.tbl Clostridioides_difficile_5488-DH_ST42.tbl Clostridioides_difficile_6599-NonSp_ST103,.tbl Clostridioides_difficile_7463-E_ST6.tbl Clostridioides_difficile_5489-M_ST15,.tbl Clostridioides_difficile_6599-NonSp_ST103.tbl Clostridioides_difficile_7464-Y_ST110,.tbl Clostridioides_difficile_5489-M_ST15.tbl Clostridioides_difficile_6600-G_ST8,.tbl Clostridioides_difficile_7464-Y_ST110.tbl Clostridioides_difficile_5490-CF_ST37,.tbl Clostridioides_difficile_6600-G_ST8.tbl Clostridioides_difficile_7466-DH_ST42,.tbl Clostridioides_difficile_5490-CF_ST37.tbl Clostridioides_difficile_6606-BK_ST11,.tbl Clostridioides_difficile_7466-DH_ST42.tbl Clostridioides_difficile_5493-DH_ST42,.tbl Clostridioides_difficile_6606-BK_ST11.tbl Clostridioides_difficile_7467-DH_ST42,.tbl Clostridioides_difficile_5493-DH_ST42.tbl Clostridioides_difficile_6607-DH_ST42,.tbl Clostridioides_difficile_7467-DH_ST42.tbl Clostridioides_difficile_5495-AF_ST41,.tbl Clostridioides_difficile_6607-DH_ST42.tbl Clostridioides_difficile_7468-NonSp_ST97,.tbl Clostridioides_difficile_5495-AF_ST41.tbl Clostridioides_difficile_6610-N_ST10,.tbl Clostridioides_difficile_7468-NonSp_ST97.tbl Clostridioides_difficile_5497-Y_ST2,.tbl Clostridioides_difficile_6610-N_ST10.tbl Clostridioides_difficile_7471-BI_ST1,.tbl Clostridioides_difficile_5497-Y_ST2.tbl Clostridioides_difficile_6612-DH_ST28,.tbl Clostridioides_difficile_7471-BI_ST1.tbl Clostridioides_difficile_5500-DH_ST42,.tbl Clostridioides_difficile_6612-DH_ST28.tbl Clostridioides_difficile_7472-Y_ST2,.tbl Clostridioides_difficile_5500-DH_ST42.tbl Clostridioides_difficile_6613-NonSp_ST3,.tbl Clostridioides_difficile_7472-Y_ST2.tbl Clostridioides_difficile_5504-L_ST55,.tbl Clostridioides_difficile_6613-NonSp_ST3.tbl Clostridioides_difficile_7473-DH_ST42,.tbl Clostridioides_difficile_5504-L_ST55.tbl Clostridioides_difficile_6614-AL_ST34,.tbl Clostridioides_difficile_7473-DH_ST42.tbl Clostridioides_difficile_5505-Y_ST2,.tbl Clostridioides_difficile_6614-AL_ST34.tbl Clostridioides_difficile_7476-Y_ST2,.tbl Clostridioides_difficile_5505-Y_ST2.tbl Clostridioides_difficile_6615-NonSp_ST11,.tbl Clostridioides_difficile_7476-Y_ST2.tbl Clostridioides_difficile_5537-D_ST9,.tbl Clostridioides_difficile_6615-NonSp_ST11.tbl Clostridioides_difficile_7478-NonSp_novelST,.tbl Clostridioides_difficile_5537-D_ST9.tbl Clostridioides_difficile_6616-NonSp_novelST,.tbl Clostridioides_difficile_7478-NonSp_novelST.tbl Clostridioides_difficile_5538-BK_ST11,.tbl Clostridioides_difficile_6616-NonSp_novelST.tbl Clostridioides_difficile_7479-NonSp_ST3,.tbl Clostridioides_difficile_5538-BK_ST11.tbl Clostridioides_difficile_6617-A_ST43,.tbl Clostridioides_difficile_7479-NonSp_ST3.tbl Clostridioides_difficile_5539-Y_ST2,.tbl Clostridioides_difficile_6617-A_ST43.tbl Clostridioides_difficile_7480-M_ST15,.tbl Clostridioides_difficile_5539-Y_ST2.tbl Clostridioides_difficile_6618-NonSp_ST53,.tbl Clostridioides_difficile_7480-M_ST15.tbl Clostridioides_difficile_5541-CF_ST37,.tbl Clostridioides_difficile_6618-NonSp_ST53.tbl Clostridioides_difficile_7481-NonSp_ST59,.tbl Clostridioides_difficile_5541-CF_ST37.tbl Clostridioides_difficile_6619-CF_ST37,.tbl Clostridioides_difficile_7481-NonSp_ST59.tbl Clostridioides_difficile_5542-Y_ST2,.tbl Clostridioides_difficile_6619-CF_ST37.tbl Clostridioides_difficile_7482-DH_ST42,.tbl Clostridioides_difficile_5542-Y_ST2.tbl Clostridioides_difficile_6621-NonSp_ST10,.tbl Clostridioides_difficile_7482-DH_ST42.tbl Clostridioides_difficile_5547-DH_ST42,.tbl Clostridioides_difficile_6621-NonSp_ST10.tbl Clostridioides_difficile_7486-A_ST43,.tbl Clostridioides_difficile_5547-DH_ST42.tbl Clostridioides_difficile_6622-CF_ST37,.tbl Clostridioides_difficile_7486-A_ST43.tbl Clostridioides_difficile_5548-Y_ST2,.tbl Clostridioides_difficile_6622-CF_ST37.tbl Clostridioides_difficile_7488-NonSp_ST103,.tbl Clostridioides_difficile_5548-Y_ST2.tbl Clostridioides_difficile_6623-NonSp_ST54,.tbl Clostridioides_difficile_7488-NonSp_ST103.tbl Clostridioides_difficile_5549-NonSp_ST21,.tbl Clostridioides_difficile_6623-NonSp_ST54.tbl Clostridioides_difficile_7491-DH_ST42,.tbl Clostridioides_difficile_5549-NonSp_ST21.tbl Clostridioides_difficile_6625-Y_ST2,.tbl Clostridioides_difficile_7491-DH_ST42.tbl Clostridioides_difficile_5550-AH_ST67,.tbl Clostridioides_difficile_6625-Y_ST2.tbl Clostridioides_difficile_7492-CF_ST37,.tbl Clostridioides_difficile_5550-AH_ST67.tbl Clostridioides_difficile_6626-Y_ST110,.tbl Clostridioides_difficile_7492-CF_ST37.tbl Clostridioides_difficile_5551-CF_ST37,.tbl Clostridioides_difficile_6626-Y_ST110.tbl Clostridioides_difficile_7494-E_ST6,.tbl Clostridioides_difficile_5551-CF_ST37.tbl Clostridioides_difficile_6628-AL_ST58,.tbl Clostridioides_difficile_7494-E_ST6.tbl Clostridioides_difficile_5552-DH_ST42,.tbl Clostridioides_difficile_6628-AL_ST58.tbl Clostridioides_difficile_7496-D_ST9,.tbl Clostridioides_difficile_5552-DH_ST42.tbl Clostridioides_difficile_6629-NonSp_novelST,.tbl Clostridioides_difficile_7496-D_ST9.tbl Clostridioides_difficile_5553-L_ST55,.tbl Clostridioides_difficile_6629-NonSp_novelST.tbl Clostridioides_difficile_7497-NonSp_novelST,.tbl Clostridioides_difficile_5553-L_ST55.tbl Clostridioides_difficile_6631-DH_ST42,.tbl Clostridioides_difficile_7497-NonSp_novelST.tbl Clostridioides_difficile_5555-DH_ST42,.tbl Clostridioides_difficile_6631-DH_ST42.tbl Clostridioides_difficile_7498-M_ST15,.tbl Clostridioides_difficile_5555-DH_ST42.tbl Clostridioides_difficile_6632-NonSp_ST3,.tbl Clostridioides_difficile_7498-M_ST15.tbl Clostridioides_difficile_5557-DH_ST42,.tbl Clostridioides_difficile_6632-NonSp_ST3.tbl Clostridioides_difficile_7499-CF_ST37,.tbl Clostridioides_difficile_5557-DH_ST42.tbl Clostridioides_difficile_6634-DH_ST42,.tbl Clostridioides_difficile_7499-CF_ST37.tbl Clostridioides_difficile_5558-AH_ST67,.tbl Clostridioides_difficile_6634-DH_ST42.tbl Clostridioides_difficile_7501-BI_ST1,.tbl Clostridioides_difficile_5558-AH_ST67.tbl Clostridioides_difficile_6635-BI_ST1,.tbl Clostridioides_difficile_7501-BI_ST1.tbl Clostridioides_difficile_5559-DH_ST42,.tbl Clostridioides_difficile_6635-BI_ST1.tbl Clostridioides_difficile_7503-Y_ST2,.tbl Clostridioides_difficile_5559-DH_ST42.tbl Clostridioides_difficile_6637-AL_ST58,.tbl Clostridioides_difficile_7503-Y_ST2.tbl Clostridioides_difficile_5573-CF_ST37,.tbl Clostridioides_difficile_6637-AL_ST58.tbl Clostridioides_difficile_7505-Y_ST2,.tbl Clostridioides_difficile_5573-CF_ST37.tbl Clostridioides_difficile_7450-DH_ST42,.tbl Clostridioides_difficile_7505-Y_ST2.tbl Clostridioides_difficile_5616-DH_ST42,.tbl Clostridioides_difficile_7450-DH_ST42.tbl Clostridioides_difficile_NCKUH-21,.tbl Clostridioides_difficile_5616-DH_ST42.tbl Clostridioides_difficile_7451-NonSp_ST2,.tbl Clostridioides_difficile_RC10,.tbl Clostridioides_difficile_5754,.tbl Clostridioides_difficile_7451-NonSp_ST2.tbl Clostridioides_difficile_RC10.tbl Clostridioides_difficile_5754.tbl Clostridioides_difficile_7452-Y_ST110,.tbl Clostridioides_difficile_RF17,.tbl Clostridioides_difficile_6583-AH_ST67,.tbl Clostridioides_difficile_7452-Y_ST110.tbl Clostridioides_difficile_RF17.tbl Clostridioides_difficile_6583-AH_ST67.tbl Clostridioides_difficile_7453-Y_ST2,.tbl Peptoanaerobacter_stomatis_OBRC8,.tbl Clostridioides_difficile_6584-AQ_ST26,.tbl Clostridioides_difficile_7453-Y_ST2.tbl Peptoanaerobacter_stomatis_OBRC8.tbl Clostridioides_difficile_6584-AQ_ST26.tbl Clostridioides_difficile_7455-A_ST43,.tbl Romboutsia_maritimum_CCRI-22766,.tbl Clostridioides_difficile_6588-G_ST8,.tbl Clostridioides_difficile_7455-A_ST43.tbl Romboutsia_maritimum_CCRI-22766.tbl Clostridioides_difficile_6588-G_ST8.tbl Clostridioides_difficile_7456-DH_ST42,.tbl Tepidibacter_mesophilus_JCM.tbl Clostridioides_difficile_6589-DH_ST42,.tbl Clostridioides_difficile_7456-DH_ST42.tbl

Finally, you can convert these back to fasta with:

for f in *tbl; do TblToFasta $f > ${f/.tbl/.fa} && rm $f; done

This will only get those sequences whose naming scheme is as described above. But that's a start. You can then see what else you have and extract the rest manually or by finding a similar approach that matches those.

@AudileF we can't know that unless we know the details of your sequences. it all depends on how the strains are defined. I assume this is about the sequences you downloaded using edirect from your other question but that is the type of thing you should mention in your question. It's only luck that I happen to have read the other one. I am trying to give you a more general solution but the names of the few sequences I downloaded are not standard so it will not be easy. — terdon
– terdon ♦, Commented Dec 8, 2017 at 10:06

Stack Exchange Network

How to split multifasta based on partial fasta header

3 Answers 3

Linked

Hot Network Questions

How to split multifasta based on partial fasta header

3 Answers 3

Linked

Related

Hot Network Questions