EDIT and a solution
Because my original question was badly phrased and I was trying to re-invent the wheel I am answering my own question now (maybe it helps someone else):
gff2fasta is a tool which does exactly what I need, which is to extract a given piece of DNA from a full genome (A huge file in fasta format called FULLGENOME.fasta).
If I know where the piece I want is, I can make a file called TEST.gff where I specify the scaffold (here: sca_5_chr5_3_0) and the beginning (here: 2390621) and end (here: 2391041) of my piece for gff2fasta, for example:
sca_5_chr5_3_0 JGI CDS 2390621 2391041 . + 0 name "e_gw1.5.88.1"; proteinId 40463; Then I only need to run
gff2fasta.pl -gff TEST.gff -fasta FULLGENOME.fasta -out test gff2fasta will then get the info from TEST.gff and output the bit I want in the file called test, which will look like this:
>sca_5_chr5_3_0_2390621_2391041_F ATGACGACCCGCGCACCCAAAGACACATACGCTCAGCCCGACTATGAGGAGGCTCACCTAGCGACGTTTGCAGCCCCAAA AGGCTACCCTATCGAGTCTATGCTACCCCCTAGCGTGAAGAGGGAGACCTTTGAACAGGCCCTAGCCGAGTTTACCGACG CTATCGGCAAAGATTATGTCTTTATCGGCGATGCTCTCTCTCACTACATCGATCCCTACGACATCTATATCGATGATAGT GAGAGAAGGAGGATGCCGAGCGCGGCTGTTTGGTACGTAACGCGGCACCCATCGAAACCTAGCAGCACTGACAAGTTTCC GCGTCTAGTCCCTCTTCGCTCGAGGAAGCTAAGCAGGCTCTCAAGGTTGCGAACAAATACGGCATTCCGATCTGGGCATT TTCCAGAGGCAAGAACCTGGG Thanks to terdon and everyone else for the help!
-------------
This is the continuation of this question, with more info and details unix: get characters 10 to 80 in a file
I think I am almost there, but still need some help.
I have tried to explain it as clearly as possible, but I am aware that it is a very specialized question, so please let me know if I can clarify something further!
What I have are three files:
note to admin: is it possible to upload files? I have no clue how to...
One file (N_haematococca_1.fasta) from which I need to extract a name:
head -1 N_haematococca_1.fasta | cut -f4 -d'|' this name in this case is:
e_gw1.5.88.1 Problem 1: The code above works fine but I have troubles getting the name (e_gw1.5.88.1) saved in a variable that I can use for later...
I want to save that name in a variable, let's call it:
firstline A second file (Necha2_best_models.gff) where I want all the lines where this name occurs:
grep -ir "e_gw1.5.88.1" --include="Necha2*.gff" > Necha2_in_genome.list but with the named variable:
grep -ir $firstline --include="Necha2*.gff" > Necha2_in_genome.list This works for the use of "e_gw1.5.88.1"...
The file I am creating here tells me the name of the DNA-fragment I want to cut (sca_5_chr5_3_0) and which bit of it I need (from character number 2390621 to character number 2392655). I need that info to get the right bits out of the third file
a=(`awk -F '\t' '{print $4}' Necha2_in_genome.list`) startDNA=${a[1]} endDNA=${a[${#a[@]}-1]} #add 1000 or other number, depending on the problems with the gene correctedstartDNA=$(($startDNA-1000)) correctedendDNA=$(($endDNA-1000)) In a third file from which I want to cut certain parts after a keyword (sca_5_chr5_3_0) in this case:
Thanks to Kamaraj and hschou I have a partial solution to this now:
cat Necha2_scaffolds.fasta | sed -n -e '/sca_5_chr5_3_0/,$p' | grep -v '^>' | tr -d '\n'|awk -v start="${correctedstartDNA}" -v end="${correctedendDNA}" '{print substr($0,start,end)}' RS= Necha2_scaffolds.fasta However if I debug this with small numbers:
cat Necha2_scaffolds.fasta | sed -n -e '/sca_5_chr5_3_0/,$p' | grep -v '^>' | tr -d '\n'|awk -v start=10 -v end=20 '{print substr($0,start,end)}' RS= Necha2_scaffolds.fasta I get this output:
r1_1_0 CCTTATCCTAGCG nmapped CTTATATATTAT nmapped TAAAAGGAGTTA unmapped TCTTATATAAA unmapped AATCTTAAGAA It seem the RS option is ignored and it only prints the characters 10 to 20 for certain lines. I have no clue why these lines are selected.
sca_5_chr5_3_0 only occurs once in the file.
other names that are there are
>sca_66_unmapped >sca_67_unmapped etc
I have to get this info from 178 genomes, they are all huge files and searching manually is just not an option.
HOW the files look:
N_haematococca_1.fasta (file 1) is a normal fasta file:
>jgi|Necha2|40463|e_gw1.5.88.1 MTTRAPKDTYAQPDYEEAHLATFAAPKGYPIESMLPPSVKRETFEQALAEFTDAIGKDYVFIGDALSHYI DPYDIYIDDSERRRMPSAAVCPSSLEEAKQALKVANKYGIPIWAFSRGKNLGYGGPSARVNGSVAFDLHR Necha2_best_models.gff (file 2) looks like this (just longer):
sca_5_chr5_3_0 JGI exon 2390621 2390892 . + . name "e_gw1.5.88.1"; transcriptId 40463 sca_5_chr5_3_0 JGI CDS 2390621 2390892 . + 0 name "e_gw1.5.88.1"; proteinId 40463; exonNumber 1 sca_5_chr5_3_0 JGI start_codon 2390621 2390623 . + 0 name "e_gw1.5.88.1" sca_5_chr5_3_0 JGI exon 2390949 2391041 . + . name "e_gw1.5.88.1"; transcriptId 40463 sca_5_chr5_3_0 JGI CDS 2390949 2391041 . + 2 name "e_gw1.5.88.1"; proteinId 40463; exonNumber 2 sca_5_chr5_3_0 JGI exon 2391104 2391311 . + . name "e_gw1.5.88.1"; transcriptId 40463 sca_5_chr5_3_0 JGI CDS 2391104 2391311 . + 2 name "e_gw1.5.88.1"; proteinId 40463; exonNumber 3 sca_5_chr5_3_0 JGI exon 2391380 2392367 . + . name "e_gw1.5.88.1"; transcriptId 40463 sca_5_chr5_3_0 JGI CDS 2391380 2392367 . + 0 name "e_gw1.5.88.1"; proteinId 40463; exonNumber 4 sca_5_chr5_3_0 JGI exon 2392421 2392485 . + . name "e_gw1.5.88.1"; transcriptId 40463 sca_5_chr5_3_0 JGI CDS 2392421 2392485 . + 1 name "e_gw1.5.88.1"; proteinId 40463; exonNumber 5 sca_5_chr5_3_0 JGI exon 2392541 2392657 . + . name "e_gw1.5.88.1"; transcriptId 40463 sca_5_chr5_3_0 JGI CDS 2392541 2392657 . + 0 name "e_gw1.5.88.1"; proteinId 40463; exonNumber 6 sca_5_chr5_3_0 JGI stop_codon 2392655 2392657 . + 0 name "e_gw1.5.88.1" sca_5_chr5_3_0 JGI exon 2396205 2396730 . + . name "e_gw1.5.227.1"; transcriptId 41333 sca_5_chr5_3_0 JGI CDS 2396205 2396730 . + 0 name "e_gw1.5.227.1"; proteinId 41333; exonNumber 1 sca_5_chr5_3_0 JGI start_codon 2396205 2396207 . + 0 name "e_gw1.5.227.1" Necha2_scaffolds.fasta (file 3) looks a bit like this (much longer GATC bits...):
>sca_8_chr1_1_0 CCTTATCCTAGCGAGGATTAAGGTCCTCGAAAAGAAAGGCAAAGATATAAAGGTAATATAATAGAAGATT AAGGTATTCTAAGTAAAGGTTATAAAGAAATAAAATAAGAAGAATATTTATAGGCTAAGAAAGACCCCCC TAAAGGTTAAGGACTTAATATTAAGATTTAATATTCCCTAATTAATTAATATATTAATAAAAATAAAGAT >sca_5_chr5_3_0 ATGACCACTATATCCATCGGCACAACGGCGTTAATCACATTTGGGTCTGCAATTTTCTGTTTTTGCGGTT TTCATGTCGGTTCCAACGGGCGGAGCTCGACCAAGAATTACATACATTACGTGGCGCGAGCGCTCTACCC TCTGGGACATGAAGGGGACGAGGAGTCTGGAGAGGTTCATGTTTGGAATCACAGCATGGTACTGCGGCAC CCTACCTTGGTCATTCGTTGGTCTCTTTTTTTCAGGGTATGGACGCAGACTCTCAGACTCGCTCTGCTTT >sca_67_unmapped CTACAGGAAGCCCTAGAGGCCCTGCAAGACCTTCCAAAGCAGCACTCTTTGCTTCTTCTTAGAGGCAGTA TCCAGCTTCTTCTAAGGCACCTCCAGAGGCAGCTAGACCCAACCGGGCTAAAAGACCTTTGGGAAGAGGC TAATACCCTTATAAGAGAGGCTATTATAGCCCTAGTGGCTAGAAGCCCTAGTGAGAGCCCTAAAGAGCCT AATTCAAGCCTTATAGCCCTTCCAGTCAGAGAGGGAGGCCTAGGAATACCCTTACACAAGGACCTAGCCC Expected final output: Is a single bit of text after the >sca_5_chr5_3_0 keyword
TTCATGTCGGTTCCAACGGGCGGAGCTCGACCAAGAATTACATACATTACGTGGCGCGAGCGCTCTACCC TCTGGGACATGAAGGGGACGAGGAGTCTGGAGAGGTTCATGTTTGGAATCACAGCATGGTACTGCGGCAC CCTACCTTGGTCATTCGTTGGTCTCTTTTTTTCAGGGTATGGACGCAGACTCTCAG
cutwith a-zoption.grep,sedandawkin a single pipeline, sincesedcan do everythinggrepcan, andawkcan do everythingsedcan.