Extracting subset from fasta file

Question

I have a fasta file which looks like this:

>chr1 ACGGTGTAGTCG >chr2 ACGTGTATAGCT >chrUn ACGTGGATATTT >chr21 ACGTTGATGAAA >chrX GTACGGGGGTGG >chrUn5 TGATAGCTGTTG

I just want to extract chr1, chr2, chr21, chrX along with their sequences. So the output I want is:

>chr1 ACGGTGTAGTCG >chr2 ACGTGTATAGCT >chr21 ACGTTGATGAAA >chrX GTACGGGGGTGG

How can I do it in unix command line?

Is that example representative or do you have the entire chromosome sequence there? The solutions will be completely different for one-line sequences and for multiline sequences. — terdon
– terdon ♦, Commented Jan 5, 2016 at 22:42

terdon · Accepted Answer · 2016-01-06 00:56:04Z

For the simple example you show, where all sequences fit on a single line, you could just use grep (if your grep doesn't support the --no-group-separator option, pass the output through grep -v -- '--'):

$ grep -wEA1 --no-group-separator 'chr1|chr2|chr21|chrX' file.fa >chr1 ACGGTGTAGTCG >chr2 ACGTGTATAGCT >chr21 ACGTTGATGAAA >chrX GTACGGGGGTGG

Assuming you have multi-line sequences (which seems likely if you're dealing with chromosomes), you will need to concatenate them into one line first. This complicates things considerably. You could use an awk one-liner:

$ awk -vRS=">" 'BEGIN{t["chr1"]=1;t["chr2"]=1;t["chr21"]=1;t["chrX"]=1} {if($1 in t){printf ">%s",$0}}' file.fa >chr1 ACGGTGTAGTCG >chr2 ACGTGTATAGCT >chr21 ACGTTGATGAAA >chrX GTACGGGGGTGG

The script above sets the record separator to > (vRS=">"). This means that "lines" are defined by >~ and not \n. Then, the BEGIN block sets up an array where each of the target sequence IDs is a key. The rest simply checks each "line" (sequence) and, if the 1st field is in the array ($i in t), it prints the current "line" ($0) preceded by a >.

If you're going to be doing this sort of thing often, writing up the array will quickly become annoying. Personally, I use the two scripts below, which I inherited from a former lab mate, to convert FASTA to tbl format (sequence_name<TAB>sequence,one sequence per line) and back:

FastaToTbl

#!/usr/bin/awk -f { if (substr($1,1,1)==">") if (NR>1) printf "\n%s\t", substr($0,2,length($0)-1) else printf "%s\t", substr($0,2,length($0)-1) else printf "%s", $0 }END{printf "\n"}

TblToFasta

#! /usr/bin/awk -f { sequence=$NF ls = length(sequence) is = 1 fld = 1 while (fld < NF) { if (fld == 1){printf ">"} printf "%s " , $fld if (fld == NF-1){ printf "\n" } fld = fld+1 } while (is <= ls){ printf "%s\n", substr(sequence,is,60) is=is+60 } }

If you save those in your $PATH and make them executable, you can simply grep for your target sequences (and this will work for multi-line sequences, unlike the above):

$ FastaToTbl file.fa | grep -wE 'chr1|chr2|chr21|chrX' | TblToFasta >chr1 ACGGTGTAGTCG >chr2 ACGTGTATAGCT >chr21 ACGTTGATGAAA >chrX GTACGGGGGTGG

This is much easier to extend since you can pass grep a file of search targets:

$ cat ids.txt chr1 chr2 chr21 chrX $ FastaToTbl file.fa | grep -wFf ids.txt | TblToFasta >chr1 ACGGTGTAGTCG >chr2 ACGTGTATAGCT >chr21 ACGTTGATGAAA >chrX GTACGGGGGTGG

Finally, if you will be working with large sequences, you might consider getting one of the various tools that can do this sort of thing for you. They will be far faster and more efficient in the long run. For example, fastafetch from the exonerate suite. It is available in the repos of most Linux distributions. On Debian based systems you can install it with sudo apt-get install exonerate, for example. Once you've installed it, you can do:

## Create the index fastaindex -f file.fa -i file.in ## Loop and retrieve your sequences for seq in chr1 chr2 chr21 chrX; do fastafetch -f file.fa -i file.in -q "$seq" done >chr1 ACGGTGTAGTCG >chr2 ACGTGTATAGCT >chr21 ACGTTGATGAAA >chrX GTACGGGGGTGG

Alternatively, you can use my own retrieveseqs.pl, which has a few other nifty functions:

$ retrieveseqs.pl -h retrieveseqs.pl will take one or more lists of ids and extract their sequences from multi FASTA file USAGE : retrieveseqs.pl [-viofsn] <FASTA sequence file> <desired IDs, one per line> -v : verbose output, print a progress indicator (a "." for every 1000 sequences processed) -V : as above but a "!" for every desired sequence found. -f : fast, takes first characters of name "(/^([^\s]*)/)" given until the first space as the search string make SURE that those chars are UNIQUE. -i : use when the ids in the id file are EXACTLY identical to those in the FASTA file -h : Show this help and exit. -o : will create one fasta file for each of the id files -s : will create one fasta file per id -n : means that the last arguments (after the sequence file) passed are a QUOTED list of the names desired. -u : assumes uniprot format ids (separated by |)

In your case, you would do:

$ retrieveseqs.pl -fn file.fa "chr1 chr2 chr21 chrX" [7 (4/4 found] >chr1 ACGGTGTAGTCG >chr2 ACGTGTATAGCT >chr21 ACGTTGATGAAA >chrX GTACGGGGGTGG

Note that this was something I wrote for my own work and it isn't very well documented or supported. Still, I've been using it happily for years.

Recent versions of GNU grep have a --no-group-separator context line option that does away with the need for the additional grep -v -- '--' when using grep -An — steeldriver
– steeldriver, Commented Jan 5, 2016 at 23:51
@steeldriver ah, thanks. I actually read through man grep before posting since I remembered something like that but didn't manage to find it. It doesn't seem to be mentioned anywhere in my Arch's man grep (only in the info page). — terdon
– terdon ♦, Commented Jan 5, 2016 at 23:53
Yes same on my Ubuntu box (14.04 w/ grep v.2.16) - only in the info page — steeldriver
– steeldriver, Commented Jan 5, 2016 at 23:56
In fastafetch in exonerate version 2.4.0 there is -F option which reads IDs from file. — boczniak767
– boczniak767, Commented Jun 12, 2023 at 21:23

chaos · Accepted Answer · 2016-01-05 22:17:43Z

With sed:

sed -n '/^>chr1$\|^>chr2$\|^>chr21$\|^>chrX$/{p;n;p}' file

-n suppresses automatic output.
/.../ the regular expression to match >chr1, >chr2, >chr21 or >chrX.
{p;n;p} if the expression matches, print the line, read the next input line to pattern space, and print that line too.

If it must be awk, it's nearly the same mechanism:

awk '/^>chr1$|^>chr2$|^>chr21$|^>chrX$/{print;getline;print;}' file

The OP hasn't mentioned this, but FASTA files often (usually) have more than a single line under the >id. Your solution works perfectly for the OP's example (so +1), but will probably fail on their actual data. — terdon
– terdon ♦, Commented Jan 5, 2016 at 23:17

Sheri Sanders · Accepted Answer · 2019-05-22 18:31:45Z

For anyone finding this, the retrieveseqs.pl is great, but slower than a similar program I have used for years by John Nash. I cannot find it online, but we have it on our github (https://github.com/NCGAS/Useful-Scripts/blob/master/subset_fasta.pl)

[xxxxx@h1 test2]$ time ../retrieveseqs.pl cdna.cds.predupclean cds.list >tmp [3926 (3925/3925 found] real 0m17.622s user 0m17.102s sys 0m0.045s [xxxxx@h1 test2]$ time ../subset_fasta.pl -i cdna.list < cdna.cds.predupclean > tmp real 0m3.111s user 0m2.987s sys 0m0.032s

EDIT: out of curiosity, I also tested the TableToFasta method, which seems to be faster than both above:

[xxxxx@c71 test2]$ time ./FastaToTable cdna.cds.predupclean | grep -wFf <(sed 's/>//g' cdna.list) | ./TableToFasta > tmp real 0m0.189s user 0m0.222s sys 0m0.047s

To test the speed of the retrieveseqs.pl script, you should use the -f flag since that speeds it up enormously. — terdon
– terdon ♦, Commented Jul 4, 2020 at 12:47

Stack Exchange Network

Extracting subset from fasta file

3 Answers 3

You must log in to answer this question.

Linked

Hot Network Questions

Extracting subset from fasta file

3 Answers 3

You must log in to answer this question.

Linked

Related

Hot Network Questions