use gff2fasta instead of a bash script to get parts of DNA sequences out of a full genome

Question

EDIT and a solution

Because my original question was badly phrased and I was trying to re-invent the wheel I am answering my own question now (maybe it helps someone else):

gff2fasta is a tool which does exactly what I need, which is to extract a given piece of DNA from a full genome (A huge file in fasta format called FULLGENOME.fasta).

If I know where the piece I want is, I can make a file called TEST.gff where I specify the scaffold (here: sca_5_chr5_3_0) and the beginning (here: 2390621) and end (here: 2391041) of my piece for gff2fasta, for example:

 sca_5_chr5_3_0 JGI CDS 2390621 2391041 . + 0 name "e_gw1.5.88.1"; proteinId 40463;

Then I only need to run

 gff2fasta.pl -gff TEST.gff -fasta FULLGENOME.fasta -out test

gff2fasta will then get the info from TEST.gff and output the bit I want in the file called test, which will look like this:

 >sca_5_chr5_3_0_2390621_2391041_F ATGACGACCCGCGCACCCAAAGACACATACGCTCAGCCCGACTATGAGGAGGCTCACCTAGCGACGTTTGCAGCCCCAAA AGGCTACCCTATCGAGTCTATGCTACCCCCTAGCGTGAAGAGGGAGACCTTTGAACAGGCCCTAGCCGAGTTTACCGACG CTATCGGCAAAGATTATGTCTTTATCGGCGATGCTCTCTCTCACTACATCGATCCCTACGACATCTATATCGATGATAGT GAGAGAAGGAGGATGCCGAGCGCGGCTGTTTGGTACGTAACGCGGCACCCATCGAAACCTAGCAGCACTGACAAGTTTCC GCGTCTAGTCCCTCTTCGCTCGAGGAAGCTAAGCAGGCTCTCAAGGTTGCGAACAAATACGGCATTCCGATCTGGGCATT TTCCAGAGGCAAGAACCTGGG

Thanks to terdon and everyone else for the help!

-------------

This is the continuation of this question, with more info and details unix: get characters 10 to 80 in a file

I think I am almost there, but still need some help.

I have tried to explain it as clearly as possible, but I am aware that it is a very specialized question, so please let me know if I can clarify something further!

What I have are three files:

note to admin: is it possible to upload files? I have no clue how to...

One file (N_haematococca_1.fasta) from which I need to extract a name:

 head -1 N_haematococca_1.fasta | cut -f4 -d'|'

this name in this case is:

 e_gw1.5.88.1

Problem 1: The code above works fine but I have troubles getting the name (e_gw1.5.88.1) saved in a variable that I can use for later...

I want to save that name in a variable, let's call it:

 firstline

A second file (Necha2_best_models.gff) where I want all the lines where this name occurs:

 grep -ir "e_gw1.5.88.1" --include="Necha2*.gff" > Necha2_in_genome.list

but with the named variable:

 grep -ir $firstline --include="Necha2*.gff" > Necha2_in_genome.list

This works for the use of "e_gw1.5.88.1"...

The file I am creating here tells me the name of the DNA-fragment I want to cut (sca_5_chr5_3_0) and which bit of it I need (from character number 2390621 to character number 2392655). I need that info to get the right bits out of the third file

 a=(`awk -F '\t' '{print $4}' Necha2_in_genome.list`) startDNA=${a[1]} endDNA=${a[${#a[@]}-1]} #add 1000 or other number, depending on the problems with the gene correctedstartDNA=$(($startDNA-1000)) correctedendDNA=$(($endDNA-1000))

In a third file from which I want to cut certain parts after a keyword (sca_5_chr5_3_0) in this case:

Thanks to Kamaraj and hschou I have a partial solution to this now:

 cat Necha2_scaffolds.fasta | sed -n -e '/sca_5_chr5_3_0/,$p' | grep -v '^>' | tr -d '\n'|awk -v start="${correctedstartDNA}" -v end="${correctedendDNA}" '{print substr($0,start,end)}' RS= Necha2_scaffolds.fasta

However if I debug this with small numbers:

 cat Necha2_scaffolds.fasta | sed -n -e '/sca_5_chr5_3_0/,$p' | grep -v '^>' | tr -d '\n'|awk -v start=10 -v end=20 '{print substr($0,start,end)}' RS= Necha2_scaffolds.fasta

I get this output:

 r1_1_0 CCTTATCCTAGCG nmapped CTTATATATTAT nmapped TAAAAGGAGTTA unmapped TCTTATATAAA unmapped AATCTTAAGAA

It seem the RS option is ignored and it only prints the characters 10 to 20 for certain lines. I have no clue why these lines are selected.

 sca_5_chr5_3_0

only occurs once in the file.

other names that are there are

 >sca_66_unmapped >sca_67_unmapped

etc

I have to get this info from 178 genomes, they are all huge files and searching manually is just not an option.

HOW the files look:

N_haematococca_1.fasta (file 1) is a normal fasta file:

 >jgi|Necha2|40463|e_gw1.5.88.1 MTTRAPKDTYAQPDYEEAHLATFAAPKGYPIESMLPPSVKRETFEQALAEFTDAIGKDYVFIGDALSHYI DPYDIYIDDSERRRMPSAAVCPSSLEEAKQALKVANKYGIPIWAFSRGKNLGYGGPSARVNGSVAFDLHR

Necha2_best_models.gff (file 2) looks like this (just longer):

 sca_5_chr5_3_0 JGI exon 2390621 2390892 . + . name "e_gw1.5.88.1"; transcriptId 40463 sca_5_chr5_3_0 JGI CDS 2390621 2390892 . + 0 name "e_gw1.5.88.1"; proteinId 40463; exonNumber 1 sca_5_chr5_3_0 JGI start_codon 2390621 2390623 . + 0 name "e_gw1.5.88.1" sca_5_chr5_3_0 JGI exon 2390949 2391041 . + . name "e_gw1.5.88.1"; transcriptId 40463 sca_5_chr5_3_0 JGI CDS 2390949 2391041 . + 2 name "e_gw1.5.88.1"; proteinId 40463; exonNumber 2 sca_5_chr5_3_0 JGI exon 2391104 2391311 . + . name "e_gw1.5.88.1"; transcriptId 40463 sca_5_chr5_3_0 JGI CDS 2391104 2391311 . + 2 name "e_gw1.5.88.1"; proteinId 40463; exonNumber 3 sca_5_chr5_3_0 JGI exon 2391380 2392367 . + . name "e_gw1.5.88.1"; transcriptId 40463 sca_5_chr5_3_0 JGI CDS 2391380 2392367 . + 0 name "e_gw1.5.88.1"; proteinId 40463; exonNumber 4 sca_5_chr5_3_0 JGI exon 2392421 2392485 . + . name "e_gw1.5.88.1"; transcriptId 40463 sca_5_chr5_3_0 JGI CDS 2392421 2392485 . + 1 name "e_gw1.5.88.1"; proteinId 40463; exonNumber 5 sca_5_chr5_3_0 JGI exon 2392541 2392657 . + . name "e_gw1.5.88.1"; transcriptId 40463 sca_5_chr5_3_0 JGI CDS 2392541 2392657 . + 0 name "e_gw1.5.88.1"; proteinId 40463; exonNumber 6 sca_5_chr5_3_0 JGI stop_codon 2392655 2392657 . + 0 name "e_gw1.5.88.1" sca_5_chr5_3_0 JGI exon 2396205 2396730 . + . name "e_gw1.5.227.1"; transcriptId 41333 sca_5_chr5_3_0 JGI CDS 2396205 2396730 . + 0 name "e_gw1.5.227.1"; proteinId 41333; exonNumber 1 sca_5_chr5_3_0 JGI start_codon 2396205 2396207 . + 0 name "e_gw1.5.227.1"

Necha2_scaffolds.fasta (file 3) looks a bit like this (much longer GATC bits...):

 >sca_8_chr1_1_0 CCTTATCCTAGCGAGGATTAAGGTCCTCGAAAAGAAAGGCAAAGATATAAAGGTAATATAATAGAAGATT AAGGTATTCTAAGTAAAGGTTATAAAGAAATAAAATAAGAAGAATATTTATAGGCTAAGAAAGACCCCCC TAAAGGTTAAGGACTTAATATTAAGATTTAATATTCCCTAATTAATTAATATATTAATAAAAATAAAGAT >sca_5_chr5_3_0 ATGACCACTATATCCATCGGCACAACGGCGTTAATCACATTTGGGTCTGCAATTTTCTGTTTTTGCGGTT TTCATGTCGGTTCCAACGGGCGGAGCTCGACCAAGAATTACATACATTACGTGGCGCGAGCGCTCTACCC TCTGGGACATGAAGGGGACGAGGAGTCTGGAGAGGTTCATGTTTGGAATCACAGCATGGTACTGCGGCAC CCTACCTTGGTCATTCGTTGGTCTCTTTTTTTCAGGGTATGGACGCAGACTCTCAGACTCGCTCTGCTTT >sca_67_unmapped CTACAGGAAGCCCTAGAGGCCCTGCAAGACCTTCCAAAGCAGCACTCTTTGCTTCTTCTTAGAGGCAGTA TCCAGCTTCTTCTAAGGCACCTCCAGAGGCAGCTAGACCCAACCGGGCTAAAAGACCTTTGGGAAGAGGC TAATACCCTTATAAGAGAGGCTATTATAGCCCTAGTGGCTAGAAGCCCTAGTGAGAGCCCTAAAGAGCCT AATTCAAGCCTTATAGCCCTTCCAGTCAGAGAGGGAGGCCTAGGAATACCCTTACACAAGGACCTAGCCC

Expected final output: Is a single bit of text after the >sca_5_chr5_3_0 keyword

 TTCATGTCGGTTCCAACGGGCGGAGCTCGACCAAGAATTACATACATTACGTGGCGCGAGCGCTCTACCC TCTGGGACATGAAGGGGACGAGGAGTCTGGAGAGGTTCATGTTTGGAATCACAGCATGGTACTGCGGCAC CCTACCTTGGTCATTCGTTGGTCTCTTTTTTTCAGGGTATGGACGCAGACTCTCAG

You got bad advice/aren't using a version of cut with a -z option. — Michael Homer
– Michael Homer, Commented Apr 7, 2017 at 8:31
what is the contents of Necha2_scaffolds.fasta and what is expected output ? — Kamaraj
– Kamaraj, Commented Apr 7, 2017 at 8:34
awk -v start="${correctedstartDNA}" -v end="${correctedendDNA}" '{print substr($0,start,end)}' RS= Necha2_scaffolds.fasta — Kamaraj
– Kamaraj, Commented Apr 7, 2017 at 8:48
Please edit your question and show us a minimal example of all files. Most people will have no idea what a gff file looks like. Show us a small gff and a small fasta file and the output you would expect from them. Also explain whether you can have sequences in the - strand so we know whether our answers will need to deal with that as well. Oh and also explain that a fasta sequence is split into multiple lines, this is essential for your answer since when looking for a character range, we will need to concatenate the sequence lines. — terdon
– terdon ♦, Commented Apr 7, 2017 at 9:04
As a general rule, it is almost never a good idea to use grep, sed and awk in a single pipeline, since sed can do everything grep can, and awk can do everything sed can. — Michael Vehrs
– Michael Vehrs, Commented Apr 7, 2017 at 9:13

Stack Exchange Network

use gff2fasta instead of a bash script to get parts of DNA sequences out of a full genome

-------------

0

You must log in to answer this question.

Linked

Hot Network Questions

use gff2fasta instead of a bash script to get parts of DNA sequences out of a full genome

-------------

0

You must log in to answer this question.

Linked

Related

Hot Network Questions