2

EDIT and a solution

Because my original question was badly phrased and I was trying to re-invent the wheel I am answering my own question now (maybe it helps someone else):

gff2fasta is a tool which does exactly what I need, which is to extract a given piece of DNA from a full genome (A huge file in fasta format called FULLGENOME.fasta).

If I know where the piece I want is, I can make a file called TEST.gff where I specify the scaffold (here: sca_5_chr5_3_0) and the beginning (here: 2390621) and end (here: 2391041) of my piece for gff2fasta, for example:

 sca_5_chr5_3_0 JGI CDS 2390621 2391041 . + 0 name "e_gw1.5.88.1"; proteinId 40463; 

Then I only need to run

 gff2fasta.pl -gff TEST.gff -fasta FULLGENOME.fasta -out test 

gff2fasta will then get the info from TEST.gff and output the bit I want in the file called test, which will look like this:

 >sca_5_chr5_3_0_2390621_2391041_F ATGACGACCCGCGCACCCAAAGACACATACGCTCAGCCCGACTATGAGGAGGCTCACCTAGCGACGTTTGCAGCCCCAAA AGGCTACCCTATCGAGTCTATGCTACCCCCTAGCGTGAAGAGGGAGACCTTTGAACAGGCCCTAGCCGAGTTTACCGACG CTATCGGCAAAGATTATGTCTTTATCGGCGATGCTCTCTCTCACTACATCGATCCCTACGACATCTATATCGATGATAGT GAGAGAAGGAGGATGCCGAGCGCGGCTGTTTGGTACGTAACGCGGCACCCATCGAAACCTAGCAGCACTGACAAGTTTCC GCGTCTAGTCCCTCTTCGCTCGAGGAAGCTAAGCAGGCTCTCAAGGTTGCGAACAAATACGGCATTCCGATCTGGGCATT TTCCAGAGGCAAGAACCTGGG 

Thanks to terdon and everyone else for the help!

-------------

This is the continuation of this question, with more info and details unix: get characters 10 to 80 in a file

I think I am almost there, but still need some help.

I have tried to explain it as clearly as possible, but I am aware that it is a very specialized question, so please let me know if I can clarify something further!

What I have are three files:

note to admin: is it possible to upload files? I have no clue how to...

One file (N_haematococca_1.fasta) from which I need to extract a name:

 head -1 N_haematococca_1.fasta | cut -f4 -d'|' 

this name in this case is:

 e_gw1.5.88.1 

Problem 1: The code above works fine but I have troubles getting the name (e_gw1.5.88.1) saved in a variable that I can use for later...

I want to save that name in a variable, let's call it:

 firstline 

A second file (Necha2_best_models.gff) where I want all the lines where this name occurs:

 grep -ir "e_gw1.5.88.1" --include="Necha2*.gff" > Necha2_in_genome.list 

but with the named variable:

 grep -ir $firstline --include="Necha2*.gff" > Necha2_in_genome.list 

This works for the use of "e_gw1.5.88.1"...

The file I am creating here tells me the name of the DNA-fragment I want to cut (sca_5_chr5_3_0) and which bit of it I need (from character number 2390621 to character number 2392655). I need that info to get the right bits out of the third file

 a=(`awk -F '\t' '{print $4}' Necha2_in_genome.list`) startDNA=${a[1]} endDNA=${a[${#a[@]}-1]} #add 1000 or other number, depending on the problems with the gene correctedstartDNA=$(($startDNA-1000)) correctedendDNA=$(($endDNA-1000)) 

In a third file from which I want to cut certain parts after a keyword (sca_5_chr5_3_0) in this case:

Thanks to Kamaraj and hschou I have a partial solution to this now:

 cat Necha2_scaffolds.fasta | sed -n -e '/sca_5_chr5_3_0/,$p' | grep -v '^>' | tr -d '\n'|awk -v start="${correctedstartDNA}" -v end="${correctedendDNA}" '{print substr($0,start,end)}' RS= Necha2_scaffolds.fasta 

However if I debug this with small numbers:

 cat Necha2_scaffolds.fasta | sed -n -e '/sca_5_chr5_3_0/,$p' | grep -v '^>' | tr -d '\n'|awk -v start=10 -v end=20 '{print substr($0,start,end)}' RS= Necha2_scaffolds.fasta 

I get this output:

 r1_1_0 CCTTATCCTAGCG nmapped CTTATATATTAT nmapped TAAAAGGAGTTA unmapped TCTTATATAAA unmapped AATCTTAAGAA 

It seem the RS option is ignored and it only prints the characters 10 to 20 for certain lines. I have no clue why these lines are selected.

 sca_5_chr5_3_0 

only occurs once in the file.

other names that are there are

 >sca_66_unmapped >sca_67_unmapped 

etc

I have to get this info from 178 genomes, they are all huge files and searching manually is just not an option.

HOW the files look:

N_haematococca_1.fasta (file 1) is a normal fasta file:

 >jgi|Necha2|40463|e_gw1.5.88.1 MTTRAPKDTYAQPDYEEAHLATFAAPKGYPIESMLPPSVKRETFEQALAEFTDAIGKDYVFIGDALSHYI DPYDIYIDDSERRRMPSAAVCPSSLEEAKQALKVANKYGIPIWAFSRGKNLGYGGPSARVNGSVAFDLHR 

Necha2_best_models.gff (file 2) looks like this (just longer):

 sca_5_chr5_3_0 JGI exon 2390621 2390892 . + . name "e_gw1.5.88.1"; transcriptId 40463 sca_5_chr5_3_0 JGI CDS 2390621 2390892 . + 0 name "e_gw1.5.88.1"; proteinId 40463; exonNumber 1 sca_5_chr5_3_0 JGI start_codon 2390621 2390623 . + 0 name "e_gw1.5.88.1" sca_5_chr5_3_0 JGI exon 2390949 2391041 . + . name "e_gw1.5.88.1"; transcriptId 40463 sca_5_chr5_3_0 JGI CDS 2390949 2391041 . + 2 name "e_gw1.5.88.1"; proteinId 40463; exonNumber 2 sca_5_chr5_3_0 JGI exon 2391104 2391311 . + . name "e_gw1.5.88.1"; transcriptId 40463 sca_5_chr5_3_0 JGI CDS 2391104 2391311 . + 2 name "e_gw1.5.88.1"; proteinId 40463; exonNumber 3 sca_5_chr5_3_0 JGI exon 2391380 2392367 . + . name "e_gw1.5.88.1"; transcriptId 40463 sca_5_chr5_3_0 JGI CDS 2391380 2392367 . + 0 name "e_gw1.5.88.1"; proteinId 40463; exonNumber 4 sca_5_chr5_3_0 JGI exon 2392421 2392485 . + . name "e_gw1.5.88.1"; transcriptId 40463 sca_5_chr5_3_0 JGI CDS 2392421 2392485 . + 1 name "e_gw1.5.88.1"; proteinId 40463; exonNumber 5 sca_5_chr5_3_0 JGI exon 2392541 2392657 . + . name "e_gw1.5.88.1"; transcriptId 40463 sca_5_chr5_3_0 JGI CDS 2392541 2392657 . + 0 name "e_gw1.5.88.1"; proteinId 40463; exonNumber 6 sca_5_chr5_3_0 JGI stop_codon 2392655 2392657 . + 0 name "e_gw1.5.88.1" sca_5_chr5_3_0 JGI exon 2396205 2396730 . + . name "e_gw1.5.227.1"; transcriptId 41333 sca_5_chr5_3_0 JGI CDS 2396205 2396730 . + 0 name "e_gw1.5.227.1"; proteinId 41333; exonNumber 1 sca_5_chr5_3_0 JGI start_codon 2396205 2396207 . + 0 name "e_gw1.5.227.1" 

Necha2_scaffolds.fasta (file 3) looks a bit like this (much longer GATC bits...):

 >sca_8_chr1_1_0 CCTTATCCTAGCGAGGATTAAGGTCCTCGAAAAGAAAGGCAAAGATATAAAGGTAATATAATAGAAGATT AAGGTATTCTAAGTAAAGGTTATAAAGAAATAAAATAAGAAGAATATTTATAGGCTAAGAAAGACCCCCC TAAAGGTTAAGGACTTAATATTAAGATTTAATATTCCCTAATTAATTAATATATTAATAAAAATAAAGAT >sca_5_chr5_3_0 ATGACCACTATATCCATCGGCACAACGGCGTTAATCACATTTGGGTCTGCAATTTTCTGTTTTTGCGGTT TTCATGTCGGTTCCAACGGGCGGAGCTCGACCAAGAATTACATACATTACGTGGCGCGAGCGCTCTACCC TCTGGGACATGAAGGGGACGAGGAGTCTGGAGAGGTTCATGTTTGGAATCACAGCATGGTACTGCGGCAC CCTACCTTGGTCATTCGTTGGTCTCTTTTTTTCAGGGTATGGACGCAGACTCTCAGACTCGCTCTGCTTT >sca_67_unmapped CTACAGGAAGCCCTAGAGGCCCTGCAAGACCTTCCAAAGCAGCACTCTTTGCTTCTTCTTAGAGGCAGTA TCCAGCTTCTTCTAAGGCACCTCCAGAGGCAGCTAGACCCAACCGGGCTAAAAGACCTTTGGGAAGAGGC TAATACCCTTATAAGAGAGGCTATTATAGCCCTAGTGGCTAGAAGCCCTAGTGAGAGCCCTAAAGAGCCT AATTCAAGCCTTATAGCCCTTCCAGTCAGAGAGGGAGGCCTAGGAATACCCTTACACAAGGACCTAGCCC 

Expected final output: Is a single bit of text after the >sca_5_chr5_3_0 keyword

 TTCATGTCGGTTCCAACGGGCGGAGCTCGACCAAGAATTACATACATTACGTGGCGCGAGCGCTCTACCC TCTGGGACATGAAGGGGACGAGGAGTCTGGAGAGGTTCATGTTTGGAATCACAGCATGGTACTGCGGCAC CCTACCTTGGTCATTCGTTGGTCTCTTTTTTTCAGGGTATGGACGCAGACTCTCAG 
19
  • You got bad advice/aren't using a version of cut with a -z option. Commented Apr 7, 2017 at 8:31
  • what is the contents of Necha2_scaffolds.fasta and what is expected output ? Commented Apr 7, 2017 at 8:34
  • 1
    awk -v start="${correctedstartDNA}" -v end="${correctedendDNA}" '{print substr($0,start,end)}' RS= Necha2_scaffolds.fasta Commented Apr 7, 2017 at 8:48
  • 1
    Please edit your question and show us a minimal example of all files. Most people will have no idea what a gff file looks like. Show us a small gff and a small fasta file and the output you would expect from them. Also explain whether you can have sequences in the - strand so we know whether our answers will need to deal with that as well. Oh and also explain that a fasta sequence is split into multiple lines, this is essential for your answer since when looking for a character range, we will need to concatenate the sequence lines. Commented Apr 7, 2017 at 9:04
  • 1
    As a general rule, it is almost never a good idea to use grep, sed and awk in a single pipeline, since sed can do everything grep can, and awk can do everything sed can. Commented Apr 7, 2017 at 9:13

0

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.