5

I'd like to delete a line if it does not start with "a" "c" "t" or "g", and the next line starts with '>'. In the following example, "`>seq3" is deleted.

Input:

>seq1 actgatgac >seq2 ctgacgtca >seq3 >seq4 gtagctagt >seq5 tgacatgca 

Expected output:

>seq1 actgatgac >seq2 ctgacgtca >seq4 gtagctagt >seq5 tgacatgca 

I've tried with sed (sed '/^>.*/{$!N;/^>.*/!P;D}' and sed '/^>/{$d;N;/^[aA;cC;gG;tT]/!D}') but got no success.

4
  • 1
    Note that the fasta format allows multi-line sequences. Just mentioning this so you can check and make 100% sure all of your sequences are one line only. Also, can't your sequences have ACTG as well as actg? And what about N? Or IUPAC ambiguity codes? Commented May 12, 2020 at 14:08
  • Sure, I know. That was just an example, sequences that I created to exemplify what I want. I have a file with thousands of sequences, and some lines had just the id with no sequences, and I just wanted to delete the lines with id but with no DNA sequence. Commented May 12, 2020 at 14:31
  • 2
    Fair enough. Many people come into bioinformatics and only know about NGS and short reads and are surprised to learn that you can have multi-line sequences in both fastq and fasta. Just wanted to be on the safe side :) By the way, you might also be interested in our sister site, Bioinformatics. Commented May 12, 2020 at 15:50
  • yes, I am! Nice to know that there's a bioinfo stackexchange, I only knew biostars and seqanswers. Commented May 12, 2020 at 17:21

4 Answers 4

7

You could try something like this:

$ sed -e '$!N;/^>.*\n>/D' -e 'P;D' file >seq1 actgatgac >seq2 ctgacgtca >seq4 gtagctagt >seq5 tgacatgca 

That is

  • maintain a two line buffer with $!N ... P;D
  • look for a pattern that starts with > and has another > after the newline
  • delete up to the newline
1

An awk example:

awk 'BEGIN {lasta="XXX"} {if ($0 !~ /^ *>/) printf("%s\n%s\n",lasta,$0); lasta=$0;}' fileNAME.txt 

equivalent to

cat fileNAME.txt | awk 'BEGIN {lasta="XXX"} {if ($0 !~ /^ *>/) printf("%s\n%s\n",lasta,$0); lasta=$0;}' 
1
  • Could you explain the syntax in this awk one-liner? Commented May 15, 2020 at 11:43
0

If you have pcregrep installed, you could try:

pcregrep -M '^>.*\n[^>]' file 

Explanation

  • -M allows multiline matches
  • Find a pattern that begins with > and ends with a newline NOT followed by >
0

Tried with awk command working fine

awk '{a[++i]=$0}/^[actg]/{for(x=NR-1;x<=NR;x++)print a[x]}' file.txt 

output

>seq1 actgatgac >seq2 ctgacgtca >seq4 gtagctagt >seq5 tgacatgca 

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.