4

I have a file with 7 columns, a GFF file having chromosomal regions.I want to collapse the rows where REGION ="exon" to only one row in the file.The row has to be collapsed on the basis of regions being overlapping with each other.

REGION START END SCORE STRAND FRAME ATTRIBUTE exon 26453 26644 . + . Transcript "XM_092971"; Name "XM_092971" exon 26842 27020 . + . Transcript "XM_092971"; Name "XM_092971" exon 30355 30899 . - . Transcript "XM_104663"; Name "XM_104663" GS_TRAN 30355 34083 . - . GS_TRAN "Hs22_30444_28_1_1"; Name "Hs22_30444_28_1_1" snp 30847 30847 . + . SNP "rs2971719"; Name "rs2971719" exon 31012 31409 . - . Transcript "XM_104663"; Name "XM_104663" exon 34013 34083 . - . Transcript "XM_104663"; Name "XM_104663" exon 40932 41071 . + . Transcript "XM_092971"; Name "XM_092971" snp 44269 44269 . + . SNP "rs2873227"; Name "rs2873227" snp 45723 45723 . + . SNP "rs2227095"; Name "rs2227095" exon 134031 134495 . - . Transcript "XM_086913"; Name "XM_086913" exon 134034 134457 . - . Transcript "XM_086914"; Name "XM_086914" 

Looking at the sample data above,only the last two rows can be merged into one row.So,the new row will become.

exon 134031 134495 . - . Transcript "XM_086913"; Name "XM_086913" 

In case,the end of the other row would have been greater than its previous,that would be the END region in that case.Basically,if there is any overlap,then take the region which starts Earlier,and the one which ends later.

There can be multiple rows of such instance,here only last 2 rows are there.One thing is that the ATRRIBUTE column will definitely show different Transcript names for such rows,which are mostly same in other cases.

Any suggestions on how to proceed.

UPDATED EXAMPLE: If the last 2 rows are like this

 exon 134031 134457 . - . Transcript "XM_086913"; Name "XM_086913" exon 134034 134495 . - . Transcript "XM_086914"; Name "XM_086914" 

Then the Output should be :

exon 134031 134495 . - . Transcript "XM_086913"; Transcript "XM_086914" 

Basically the START from first and END from second.Since I want to cover the overlap in one row only,instead of 2 or 3 or more rows.Here the overlap is between 2 rows, but could be between more than 2 rows as well.

UPDATED EXAMPLE (3/24/2014)

chr1 HAVANA stop_codon 1120520 1120522 . + 0 gene_id "ENSG00000162571.9"; transcript_id "ENST00000379288.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "TTLL10-001"; level 2; tag "CCDS"; ccdsid "CCDS8.1"; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002420.2"; chr1 HAVANA UTR 1115077 1115233 . + . gene_id "ENSG00000162571.9"; transcript_id "ENST00000379288.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "TTLL10-001"; level 2; tag "CCDS"; ccdsid "CCDS8.1"; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002420.2"; chr1 HAVANA UTR 1115414 1115433 . + . gene_id "ENSG00000162571.9"; transcript_id "ENST00000379288.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "TTLL10-001"; level 2; tag "CCDS"; ccdsid "CCDS8.1"; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002420.2"; chr1 HAVANA UTR 1120520 1121244 . + . gene_id "ENSG00000162571.9"; transcript_id "ENST00000379288.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "TTLL10-001"; level 2; tag "CCDS"; ccdsid "CCDS8.1"; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002420.2"; chr1 HAVANA transcript 1115864 1119307 . + . gene_id "ENSG00000162571.9"; transcript_id "ENST00000460998.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "TTLL10-004"; level 2; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002423.1"; chr1 HAVANA exon 1115864 1116240 . + . gene_id "ENSG00000162571.9"; transcript_id "ENST00000460998.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "TTLL10-004"; level 2; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002423.1"; chr1 HAVANA *exon 1117121 1117195* . + . gene_id "ENSG00000162571.9"; transcript_id "ENST00000460998.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "TTLL10-004"; level 2; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002423.1"; chr1 HAVANA *exon 1117150 1117826* . + . gene_id "ENSG00000162571.9"; transcript_id "ENST00000460998.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "TTLL10-004"; level 2; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002423.1"; chr1 HAVANA exon 1118256 1118427 . + . gene_id "ENSG00000162571.9"; transcript_id "ENST00000460998.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "TTLL10-004"; level 2; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002423.1"; chr1 HAVANA transcript 1190648 1209229 . - . gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2"; chr1 HAVANA exon 1209046 1209229 . - . gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2"; chr1 HAVANA exon 1203113 1203372 . - . gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2"; chr1 HAVANA CDS 1203241 1203372 . - 0 gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2"; chr1 HAVANA start_codon 1203370 1203372 . - 0 gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2"; chr1 HAVANA stop_codon 1203238 1203240 . - 0 gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2"; chr1 HAVANA exon 1198726 1198766 . - . gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2"; chr1 HAVANA exon 1192588 1192690 . - . gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2"; chr1 HAVANA exon 1192372 1192510 . - . gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2"; chr1 HAVANA *exon 1191425 1191505* . - . gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2"; chr1 HAVANA *exon 1190648 1191470* . - . gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2"; 

The upper portion shows the overlap in "+" strand,and the below portion shows the overlap in "-" strand.The "-" strand has decreasing regions,so overlap will be like shown in the last 2 rows.Both are different genes.So overlap has to be per gene,as sometimes different genes have overlapping exons also but this is very rare as I have read in some posts. The gene information can be extracted from the last column,present as"gene_name".

The two rows from gene_name=TTLL10 have overlapping exons,so they will be merged, in the final output.

chr1 HAVANA *exon 1117121 1117195* . + . transcript_id "ENST00000460998.1"; gene_name "TTLL10"; chr1 HAVANA *exon 1117150 1117826* . + . transcript_id "ENST00000460998.1"; gene_name "TTLL10"; 

The two rows from gene_name= UBE2J2 have overlap exons.

 chr1 HAVANA *exon 1191425 1191505* . - . transcript_id "ENST00000473215.1"; gene_name "UBE2J2"; chr1 HAVANA *exon 1190648 1191470* . - . transcript_id "ENST00000473215.1"; gene_name "UBE2J2"; 

SAMPLE OUTPUT

The rest of the rows remain same,and the above rows get merged for each gene.

chr1 HAVANA *exon 1117121 1117826* . + . transcript_id "ENST00000460998.1"; gene_name "TTLL10"; chr1 HAVANA *exon 1190648 1191505* . - . transcript_id "ENST00000473215.1"; gene_name "UBE2J2"; 

In case,the transcript_ids are different,both transcript id's would be printed although gene_name will remain same.for e.g If for gene,the transcript id's were different like below:

 chr1 HAVANA *exon 1191425 1191505* . - . transcript_id "ENST00000473215.1"; gene_name "UBE2J2"; chr1 HAVANA *exon 1190648 1191470* . - . transcript_id "ENST00000473215.2"; gene_name "UBE2J2"; 

This will merge to as above,but should have both transcript names.Since,I believe it might be needed and will be important later on to preserve the transcript information.

 chr1 HAVANA *exon 1190648 1191505* . - . transcript_id "ENST00000473215.1"; "ENST00000473215.2" gene_name "UBE2J2"; 
11
  • What about elements that overlap but on the other strand? What about nested genes? Also, the standard GFF format has \t separating the fields. This is not a normal GFF file and will not be read correctly by software expecting true GFF. In short, please clarify what you are attempting to do, this is more than a simple text parsing problem, there are biological considerations as well. Commented Mar 22, 2014 at 18:32
  • Thats a very valid point.I am trying to count the RNA sequencing reads that fall in exons,since those are counted twice in case the exons overlap, I want to make only one exon out of an overlap so that the read is counted once.I cut out the first two columns for a Sample GFF file here,But the other columns are from a GFF file.I did not want to complicate things for non-biological background people,so wanted to be as simple as I could.I can provide more information if required. Commented Mar 23, 2014 at 2:50
  • And can you be relatively sure you will never have things that overlap on the opposite strand? Can we safely ignore the strand information? Also, how big is this file? Is a solution that reads the whole thing into memory acceptable? Commented Mar 23, 2014 at 2:54
  • One more thing, is the first field in the real file a gene name perhaps or is it a chromosome or other large region? Meaning, will each set of overlapping exons have the same first field? Commented Mar 23, 2014 at 3:40
  • Yes,each set of overlapping exons should belong to the same gene.Since later on I would look at genes after counting reads in the overlapping eons. Commented Mar 24, 2014 at 3:01

2 Answers 2

6

An 'awk' approach,

awk ' $1!="exon" { # If the first died is unequal to "exon" if(previous)print previous # If there is a previous line then print it print # Print the current line previous=start=end=exon_seq="" # Set all variable to an empty string next # Move on to the next line in the input file } { if(exon_seq) { # if there is a sequence of lines with "exon in field 1 if(start<=$2 && end>=$3) # if the start value (field 2) of the previous line # is less or equal to the current line and the end # value of the previous line is greater than or # equal to field 3 of the current line next # then do nothing and read the next line else # if there is no overlap, print previous # then print the previous line } else { # if we are not already in the a sequence of # "exon" lines, then this is the first one exon_seq=1 # so exon_seq should become 1 } previous=$0; start=$2; end=$3 # `start` become field2, `end` becomes field 3 and # `previous` becomes the current record (line) } END{ # After all lines are processed if(previous) print previous # If there still is a previous line, then print it } ' file 
12
  • What are p,b,e,r,x in the script? Commented Mar 21, 2014 at 20:26
  • @Ron, you're right, changed the letters to a bit more mnemonic variable names.. Commented Mar 21, 2014 at 20:39
  • I ran the script but its producing the same input file as its output. Commented Mar 21, 2014 at 21:12
  • @Ron Is your file TAB separated? Otherwise, try leaving out -F'\t' In fact, it is not required here, I removed it from the script, could you try again? Commented Mar 21, 2014 at 21:22
  • Works after removing -F '\t'. Can you explain the code? Commented Mar 21, 2014 at 21:25
2

I would use Perl to solve such a complicated task. Here is a partial solution, you might need to tweak it to suit you better:

#!/usr/bin/perl use warnings; use strict; use List::Util qw{ max }; sub output { my $previous = shift; print join ' ', 'exon', @{$previous}{qw(start end score strand frame attribute)}; } $\ = "\n"; my %previous; while (<>) { chomp; my ($region, $start, $end, $score, $strand, $frame, $attribute) = split ' ', $_, 7; if ($. == 1) { print; } elsif ('exon' eq $region) { if (keys %previous and $start < $previous{end} # Overlap. ) { if ($end > $previous{end}) { # Not contained. $previous{attribute} =~ s/; Name .*//; $previous{attribute} .= '; ' . ($attribute =~ /(Transcript ".*?")/)[0]; $previous{end} = $end; } } else { if (keys %previous) { output(\%previous); } %previous = ( start => $start, end => $end, score => $score, strand => $strand, frame => $frame, attribute => $attribute, ); } } else { output(\%previous) if keys %previous; %previous = (); } } output(\%previous) if keys %previous; 
16
  • Can you please explain the code a little bit? Commented Mar 21, 2014 at 18:40
  • You keep the last line in %previous if it is an exon. If the current line is an exon, too, you compare their intervals and forget the previous one if it is part of the current one. Commented Mar 22, 2014 at 12:35
  • So it chooses the START whichever is earlier and END whichever is the later among "exon" rows?As the last 2 rows where the exons have to be merged could have interchanged ends,In that case START and END would be from different rows.Also Too many if-elses have made a bit too complicated to understand. Commented Mar 22, 2014 at 15:33
  • @Ron: I probably do not fully understand the specification. If the END came from the second row, where would the attribute come from? Commented Mar 22, 2014 at 16:04
  • so if you compare the starts from both rows,134031 starts earlier so I want that.And end is 134495 so,in this case this row should be the only row I want,but if the end would have been 134457 for this row and the other row had its end as 134495.Then Start of the first and end of the second row. Commented Mar 22, 2014 at 16:41

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.