Delete line if next line starts with pattern

Question

I'd like to delete a line if it does not start with "a" "c" "t" or "g", and the next line starts with '>'. In the following example, "`>seq3" is deleted.

Input:

>seq1 actgatgac >seq2 ctgacgtca >seq3 >seq4 gtagctagt >seq5 tgacatgca

Expected output:

>seq1 actgatgac >seq2 ctgacgtca >seq4 gtagctagt >seq5 tgacatgca

I've tried with sed (sed '/^>.*/{$!N;/^>.*/!P;D}' and sed '/^>/{$d;N;/^[aA;cC;gG;tT]/!D}') but got no success.

Note that the fasta format allows multi-line sequences. Just mentioning this so you can check and make 100% sure all of your sequences are one line only. Also, can't your sequences have ACTG as well as actg? And what about N? Or IUPAC ambiguity codes? — terdon
– terdon ♦, Commented May 12, 2020 at 14:08
Sure, I know. That was just an example, sequences that I created to exemplify what I want. I have a file with thousands of sequences, and some lines had just the id with no sequences, and I just wanted to delete the lines with id but with no DNA sequence. — Tiago Minuzzi
– Tiago Minuzzi, Commented May 12, 2020 at 14:31
Fair enough. Many people come into bioinformatics and only know about NGS and short reads and are surprised to learn that you can have multi-line sequences in both fastq and fasta. Just wanted to be on the safe side :) By the way, you might also be interested in our sister site, Bioinformatics. — terdon
– terdon ♦, Commented May 12, 2020 at 15:50
yes, I am! Nice to know that there's a bioinfo stackexchange, I only knew biostars and seqanswers. — Tiago Minuzzi
– Tiago Minuzzi, Commented May 12, 2020 at 17:21

steeldriver · Accepted Answer · 2020-05-12 13:50:42Z

You could try something like this:

$ sed -e '$!N;/^>.*\n>/D' -e 'P;D' file >seq1 actgatgac >seq2 ctgacgtca >seq4 gtagctagt >seq5 tgacatgca

That is

maintain a two line buffer with $!N ... P;D
look for a pattern that starts with > and has another > after the newline
delete up to the newline

AdminBee · Accepted Answer · 2020-05-15 07:13:57Z

An awk example:

awk 'BEGIN {lasta="XXX"} {if ($0 !~ /^ *>/) printf("%s\n%s\n",lasta,$0); lasta=$0;}' fileNAME.txt

equivalent to

cat fileNAME.txt | awk 'BEGIN {lasta="XXX"} {if ($0 !~ /^ *>/) printf("%s\n%s\n",lasta,$0); lasta=$0;}'

Could you explain the syntax in this awk one-liner?

Tiago Minuzzi
– Tiago Minuzzi

2020-05-15 11:43:22 +00:00
Commented May 15, 2020 at 11:43 — Tiago Minuzzi
– Tiago Minuzzi, Commented May 15, 2020 at 11:43

enharmonic · Accepted Answer · 2020-05-13 20:51:09Z

If you have pcregrep installed, you could try:

pcregrep -M '^>.*\n[^>]' file

Explanation

-M allows multiline matches
Find a pattern that begins with > and ends with a newline NOT followed by >

Praveen Kumar BS · Accepted Answer · 2020-05-17 09:58:37Z

Tried with awk command working fine

awk '{a[++i]=$0}/^[actg]/{for(x=NR-1;x<=NR;x++)print a[x]}' file.txt

output

>seq1 actgatgac >seq2 ctgacgtca >seq4 gtagctagt >seq5 tgacatgca

Stack Exchange Network

Delete line if next line starts with pattern

4 Answers 4

You must log in to answer this question.

Linked

Hot Network Questions

Delete line if next line starts with pattern

4 Answers 4

You must log in to answer this question.

Linked

Related

Hot Network Questions