For the simple example you show, where all sequences fit on a single line, you could just use grep (if your grep doesn't support the --no-group-separator option, pass the output through grep -v -- '--'):
$ grep -wEA1 --no-group-separator 'chr1|chr2|chr21|chrX' file.fa >chr1 ACGGTGTAGTCG >chr2 ACGTGTATAGCT >chr21 ACGTTGATGAAA >chrX GTACGGGGGTGG
Assuming you have multi-line sequences (which seems likely if you're dealing with chromosomes), you will need to concatenate them into one line first. This complicates things considerably. You could use an awk one-liner:
$ awk -vRS=">" 'BEGIN{t["chr1"]=1;t["chr2"]=1;t["chr21"]=1;t["chrX"]=1} {if($1 in t){printf ">%s",$0}}' file.fa >chr1 ACGGTGTAGTCG >chr2 ACGTGTATAGCT >chr21 ACGTTGATGAAA >chrX GTACGGGGGTGG
The script above sets the record separator to > (vRS=">"). This means that "lines" are defined by >~ and not \n. Then, the BEGIN block sets up an array where each of the target sequence IDs is a key. The rest simply checks each "line" (sequence) and, if the 1st field is in the array ($i in t), it prints the current "line" ($0) preceded by a >.
If you're going to be doing this sort of thing often, writing up the array will quickly become annoying. Personally, I use the two scripts below, which I inherited from a former lab mate, to convert FASTA to tbl format (sequence_name<TAB>sequence,one sequence per line) and back:
FastaToTbl
#!/usr/bin/awk -f { if (substr($1,1,1)==">") if (NR>1) printf "\n%s\t", substr($0,2,length($0)-1) else printf "%s\t", substr($0,2,length($0)-1) else printf "%s", $0 }END{printf "\n"}
TblToFasta
#! /usr/bin/awk -f { sequence=$NF ls = length(sequence) is = 1 fld = 1 while (fld < NF) { if (fld == 1){printf ">"} printf "%s " , $fld if (fld == NF-1){ printf "\n" } fld = fld+1 } while (is <= ls){ printf "%s\n", substr(sequence,is,60) is=is+60 } }
If you save those in your $PATH and make them executable, you can simply grep for your target sequences (and this will work for multi-line sequences, unlike the above):
$ FastaToTbl file.fa | grep -wE 'chr1|chr2|chr21|chrX' | TblToFasta >chr1 ACGGTGTAGTCG >chr2 ACGTGTATAGCT >chr21 ACGTTGATGAAA >chrX GTACGGGGGTGG
This is much easier to extend since you can pass grep a file of search targets:
$ cat ids.txt chr1 chr2 chr21 chrX $ FastaToTbl file.fa | grep -wFf ids.txt | TblToFasta >chr1 ACGGTGTAGTCG >chr2 ACGTGTATAGCT >chr21 ACGTTGATGAAA >chrX GTACGGGGGTGG
Finally, if you will be working with large sequences, you might consider getting one of the various tools that can do this sort of thing for you. They will be far faster and more efficient in the long run. For example, fastafetch from the exonerate suite. It is available in the repos of most Linux distributions. On Debian based systems you can install it with sudo apt-get install exonerate, for example. Once you've installed it, you can do:
## Create the index fastaindex -f file.fa -i file.in ## Loop and retrieve your sequences for seq in chr1 chr2 chr21 chrX; do fastafetch -f file.fa -i file.in -q "$seq" done >chr1 ACGGTGTAGTCG >chr2 ACGTGTATAGCT >chr21 ACGTTGATGAAA >chrX GTACGGGGGTGG
Alternatively, you can use my own retrieveseqs.pl, which has a few other nifty functions:
$ retrieveseqs.pl -h retrieveseqs.pl will take one or more lists of ids and extract their sequences from multi FASTA file USAGE : retrieveseqs.pl [-viofsn] <FASTA sequence file> <desired IDs, one per line> -v : verbose output, print a progress indicator (a "." for every 1000 sequences processed) -V : as above but a "!" for every desired sequence found. -f : fast, takes first characters of name "(/^([^\s]*)/)" given until the first space as the search string make SURE that those chars are UNIQUE. -i : use when the ids in the id file are EXACTLY identical to those in the FASTA file -h : Show this help and exit. -o : will create one fasta file for each of the id files -s : will create one fasta file per id -n : means that the last arguments (after the sequence file) passed are a QUOTED list of the names desired. -u : assumes uniprot format ids (separated by |)
In your case, you would do:
$ retrieveseqs.pl -fn file.fa "chr1 chr2 chr21 chrX" [7 (4/4 found] >chr1 ACGGTGTAGTCG >chr2 ACGTGTATAGCT >chr21 ACGTTGATGAAA >chrX GTACGGGGGTGG
Note that this was something I wrote for my own work and it isn't very well documented or supported. Still, I've been using it happily for years.