Use another file to extract part of a line that matches with grep, as well as the following line, then save to new file

Question

I have a file that has a DNA sequence identifier in one line and the DNA sequence in the next line right below it. The DNA sequence is long but it is in one line.

File1.fasta:

>AB244308.1.1447 233_28379 1..292 
-----------------------------------------------------------------------------------------------------------------------------------------------------GTGCCAG-C-C-G-C-CGC-GGTAATAC-GG-AGGAT-GCG-A-GCG-TTATC-CGG-ATTCATT-GG-GT-TTA--AAGGGTGCGCAGG-C-G-G-GCGT-A-T------------------------------------AA----G-T-C-A-----------------------------------------------------G-G-G--G--TG--A-AA-TG--C-C-AC-G-G---------------------------------------------------------------------------------------------------------------------------------------CT-C-AA----------------------------------------------------------------------------------------------------------------------------------------------------------------C-C-G-T-G-G-A--A-C----T-G--C-C---T--T----------------------------T--GA-T-A---C----------------------------------------------------T--G-T--AT--G-T-C----------------------------------------------------------------------------------------------------------------------------------T-T-G-A-G-T--T-----T-AG------TT-G-A---------------------A-G-T-G---GG-C---------------------------------------------------------------------------------------------------------------------------------------GG--A--ATG------------------------------------------------------------------------------------------------------------------------------------T-A-G-C-AT--GT-A-G-CG-GT--G--------------A--A-A---------------------------------------------------------------------------------------------------TG-C-AT-AG--AG-A-TG-------------------------------C-T------A-C------A-G-A-AC-A-CC------------------------------------------------GA--T--A--GC-GAA-G--G-C----A--------G--C-T-C-A---CTA---------A--GT-T-A-----------------------------------------------------------------------------------------------------------------------------------------A-G--------A-C-T--GA--CG-----C---------------------------------------------TC--A-TG--C-A-CG-A--AA-G-C----G-TG--GG-G-AT-C-A-AA-CA--GG-AT--------TA-G-ATA--------CC-C-C-C-GTA--GT-C-C-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

There's about 112,000 sequences in this file that follow that format. I have about 20 sequence identifiers that I'd like to pull from the fasta file and save to another file.

The sequence identifiers are in a txt file like this:

File2.txt:

AB244308.1.1447 New.ReferenceOTU151 New.CleanUp.ReferenceOTU19 New.ReferenceOTU59 New.CleanUp.ReferenceOTU6

In addition to pulling lines with the sequence identifiers, I'd like to pull the following line with the DNA sequence as well and print all of this to a new text file.

I've found through this answer (How to extract lines from a text file that contains strings from a list in another file?) that I would need to use grep and sed. I have also found another answer (https://stackoverflow.com/questions/7103531/how-to-get-the-part-of-file-after-the-line-that-matches-grep-expression-first) relevant to getting the line after the grep match.

Unfortunately, I am unsure how to proceed in combining these answers to get what I want.

can you create simpler input/output example to illustrate your problem? for example, assume input file contains numbers 1 to 10 (in separate lines) and another file contains numbers 3 and 7... do you want output as 3,4,7,8 ? also, you have GNU grep or some other version? — Sundeep
– Sundeep, Commented Mar 20, 2017 at 7:05

user218374 · Accepted Answer · 2017-03-20 09:31:02Z

As they say, there's more than one way to skin a cat:

grep -F -f File2.txt -A 1 File1.fasta > File3.log < File2.txt sed -e 's|[.]|\\&|g; s|.*|g/^>&/.,.+1W File3.log|' | ed -s - File1.fasta

Here we are making the sequence identifiers suitable for generating an ed batch script dynamically. Which is then passed on to ed which uses it to munge your fasta file and stores the results in File3.log

Community · Accepted Answer · 2017-04-13 12:36:56Z

If your sequences are always on a single line (which is not the standard fasta format, by the way, fasta normally has 60 characters per line), this is trivial. Just use grep with -A 1 to print the matching line and the next one and -f to feed it a list of patterns to search for:

grep -A1 -f File2.txt File1.fasta

However, this will fail if you have one sequence called >foobar and another named >foo and you search for foo. It will print both in that case. For more sophisticated solutions, see my answer here. Let me know if you want the retrievesqs.pl script, it's no longer available at the link there. I'll need to update that answer.

Philippos · Accepted Answer · 2017-03-20 09:05:51Z

Is there a txt file for each of the 20 identifiers? Then assuming they are called sequence1.txt and so on (please adapt), do

for file in sequence*.txt; do id=`grep AB $file` grep $id -A1 $file1.fasta |grep -v $id done

Second line supposes that the id always contains AB. If not, maybe it's always the first line, then use head -1 $file instead.

Third line extracts the id line and following line. The second grep removes the id line. You can remove it, if you want an output of the id line along with the sequence, so you know which sequence is for which id.

With additional grep option -m1 you can speed up the search a bit, because you know, there is only one match in the file.

Stack Exchange Network

Use another file to extract part of a line that matches with grep, as well as the following line, then save to new file

3 Answers 3

You must log in to answer this question.

Linked

Hot Network Questions

Use another file to extract part of a line that matches with grep, as well as the following line, then save to new file

3 Answers 3

You must log in to answer this question.

Linked

Related

Hot Network Questions