extract word from file using information from a subset of the file (multiple steps)

Question

I have two large files that look like the following:

File 1:

NW_006502347.1 316684 NW_006527876.1 351 NW_006502151.1 27628 NW_006526579.1 232 NW_006525259.1 132 NW_006501641.1 437014 NW_006525259.1 378 NW_006523082.1 215 NW_006522424.1 153 NW_006522101.1 815 NW_006521985.1 505 NW_006521985.1 527 NW_006521722.1 920 NW_006521525.1 73 NW_006521432.1 258 NW_006521302.1 938 NW_006521272.1 585 NW_006521272.1 745 NW_006521038.1 202 NW_006519846.1 1528 NW_006519837.1 10215

File 2:

NW_006502347.1 Gnomon CDS 305319 305340 . + 0 ID=cds43608 Parent=rna48098 Dbxref=GeneID:102908761,Genbank:XP_006997436.2 Name=XP_006997436.2 gbkey=CDS gene=LOC102908761 partial=true product=histone deacetylase 4-like protein_id=XP_006997436.2 NW_006501037.1 Gnomon gene 6936 115174 . - . ID=gene0 Dbxref=GeneID:102922816 Name=Efl1 gbkey=Gene gene=Efl1 gene_biotype=protein_coding partial=true start_range=.,6936 NW_006501037.1 Gnomon mRNA 6936 115174 . - . ID=rna0 Parent=gene0 Dbxref=GeneID:102922816,Genbank:XM_006970114.2 Name=XM_006970114.2 gbkey=mRNA gene=Efl1 model_evidence=Supporting evidence includes similarity to: 5 mRNAs%2C 7 Proteins%2C and 99%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 16 samples with support for all annotated introns partial=true product=elongation factor like GTPase 1 start_range=.,6936 transcript_id=XM_006970114.2 NW_006501037.1 Gnomon exon 115095 115174 . - . ID=id1 Parent=rna0 Dbxref=GeneID:102922816,Genbank:XM_006970114.2 gbkey=mRNA gene=Efl1 partial=true product=elongation factor like GTPase 1 transcript_id=XM_006970114.2 NW_006501037.1 Gnomon exon 114246 114355 . - . ID=id2 Parent=rna0 Dbxref=GeneID:102922816,Genbank:XM_006970114.2 gbkey=mRNA gene=Efl1 partial=true product=elongation factor like GTPase 1 transcript_id=XM_006970114.2 NW_006502347.1 Gnomon mRNA 272091 477077 . + . ID=rna48098 Parent=gene26399 Dbxref=GeneID:102908761,Genbank:XM_006997374.2 Name=XM_006997374.2 end_range=477077,. gbkey=mRNA gene=LOC102908761 model_evidence=Supporting evidence includes similarity to: 1 mRNA%2C and 90%25 coverage of the annotated genomic feature by RNAseq alignments partial=true product=histone deacetylase 4-like transcript_id=XM_006997374.2 NW_006501037.1 Gnomon exon 92339 92472 . - . ID=id5 Parent=rna0 Dbxref=GeneID:102922816,Genbank:XM_006970114.2 gbkey=mRNA gene=Efl1 partial=true product=elongation factor like GTPase 1 transcript_id=XM_006970114.2 NW_006501037.1 Gnomon exon 90969 91106 . - . ID=id6 Parent=rna0 Dbxref=GeneID:102922816,Genbank:XM_006970114.2 gbkey=mRNA gene=Efl1 partial=true product=elongation factor like GTPase 1 transcript_id=XM_006970114.2 NW_006501037.1 Gnomon exon 89261 89475 . - . ID=id7 Parent=rna0 Dbxref=GeneID:102922816,Genbank:XM_006970114.2 gbkey=mRNA gene=Efl1 partial=true product=elongation factor like GTPase 1 transcript_id=XM_006970114.2 NW_006502151.1 Gnomon exon 26099 27657 . - . ID=id586652 Parent=rna47002 Dbxref=GeneID:102918658,Genbank:XM_006996663.2 gbkey=mRNA gene=Rftn1 product=raftlin%2C lipid raft linker 1 transcript_id=XM_006996663.2 NW_006501641.1 Gnomon mRNA 393496 438556 . + . ID=rna40001 Parent=gene21212 Dbxref=GeneID:102913870,Genbank:XM_015986269.1 Name=XM_015986269.1 gbkey=mRNA gene=LOC102913870 model_evidence=Supporting evidence includes similarity to: 9 mRNAs%2C 5 Proteins%2C and 81%25 coverage of the annotated genomic feature by RNAseq alignments product=transmembrane protein 189 transcript_id=XM_015986269.1 NW_006501053.1 Gnomon exon 5104713 5104872 . + . ID=id45206 Parent=rna3590 Dbxref=GeneID:102916769,Genbank:XR_001580019.1 gbkey=misc_RNA gene=Mycbpap product=MYCBP associated protein%2C transcript variant X4 transcript_id=XR_001580019.1 NW_006501053.1 Gnomon exon 5104959 5105062 . + . ID=id45207 Parent=rna3590 Dbxref=GeneID:102916769,Genbank:XR_001580019.1 gbkey=misc_RNA gene=Mycbpap product=MYCBP associated protein%2C transcript variant X4 transcript_id=XR_001580019.1 NW_006501053.1 Gnomon exon 5105698 5105881 . + . ID=id45208 Parent=rna3590 Dbxref=GeneID:102916769,Genbank:XR_001580019.1 gbkey=misc_RNA gene=Mycbpap product=MYCBP associated protein%2C transcript variant X4 transcript_id=XR_001580019.1 NW_006501053.1 Gnomon exon 5106131 5106246 . + . ID=id45209 Parent=rna3590 Dbxref=GeneID:102916769,Genbank:XR_001580019.1 gbkey=misc_RNA gene=Mycbpap product=MYCBP associated protein%2C transcript variant X4 transcript_id=XR_001580019.1

I want to use information from file 1 to extract the word following "gene=" under column 13 or 14 (e.g. "Efl1"). More specifically, I want to:

Step 1) Compare labels from column 1 from file 1 (e.g. NW_006527876.1) to labels of column 1 from file 2, extract all the rows for which column 1 (file 2) matches column 1 of file 1.

As you can see, the labels of column 1 (file 2) repeat, so there will be multiple matches for every label of file 1.

Column 4 and 5 of file 2 represent an interval, with column 4 being the start and column 5 the end of the interval. Column 2 of file 1 represent numbers in between these intervals.

Step 2) Of the rows isolated from step 1, extract the rows for which the number under column 2 (file 1) lies between the interval denoted by column 4 and 5 of file 2.

This is far beyond what I know but below is an idea of how the commands might look like.

awk '{ print $1 }' file 1 | awk `$4 *(file2)* < $2 *(file1)*' | awk '$5 *(file2)* > $2 *(file1)*' > output.tsv

The output should have rows with unique labels under column 1.

Step 3) From the output.tsv file created above, I'd like to extract the word that follows the equal sign in "gene=" under either column 13 or 14 (see below), such that I end up with a file just with the words following the equal sign.

Final output file (based this example):

LOC102908761 Rftn1 LOC102913870

ANSWER:

while read -r id pos; do awk -v id="$id" -v pos="$pos" '$1 == id && pos > $4 && pos < $5 { if (gensub(/.*gene=([A-Za-z0-9]*).*/, "\\1", 1) !~ /\s/) print gensub(/.*gene=([A-Za-z0-9]*).*/, "\\1", 1); }' < file2.txt; done <file1.txt > gene_hits.txt

Next to get rid of replicates (same locus in multiple intervals) while keeping different loci mapped to the same gene, do:

perl -ne 'print if ++$k{$_}==1' A_gene_hits.txt > A_genes.txt

Please post a minimal working example for the input files, as well as your expected output. With your current example, if I understand correctly, there are no matches, so there should be no output. — Sparhawk
– Sparhawk, Commented Mar 3, 2018 at 3:10
I've added matches to show what the output file would look like. — Age87
– Age87, Commented Mar 3, 2018 at 5:02
Hang on, so NW_006502347.1 matches twice. Wouldn't you expect two outputs for that? Or do you only want the first match? Also, please tag me with @Sparhawk so I am notified of replies. — Sparhawk
– Sparhawk, Commented Mar 3, 2018 at 7:26
@Sparhawk yes I only want one NW_006502347.1 because in the second number from list 1 (316684) falls within the interval 272091 477077, but not between 305319 305340. There will be many multiple hits based on the first column (NW_#) but the second number from list 1 should fall between the interval of only one of those hits. — Age87
– Age87, Commented Mar 3, 2018 at 20:02

Sparhawk · Accepted Answer · 2018-03-04 03:48:40Z

Assuming you have white space as a delimiter:

$ while read -r id pos; do awk -v id="$id" -v pos="$pos" '$1 == id && pos > $4 && pos < $5 { print gensub(/.*gene=([A-Za-z0-9]*).*/, "\\1", 1) }' <file2; done <file1 LOC102908761 Rftn1 LOC102913870

Explanation

while read -r id pos; do FOO; done <file1: this reads file1 line by line, and places the first field (e.g. NW_006502347.1) into the shell variable $id, and the second field (e.g. 316684) into the shell variable $pos. It then runs FOO for each line.
awk -v id="$id" -v pos="$pos" 'BAR' <file2: for each line of file1, we will run an awk command that will run BAR. This will search file2 for the matching parts. We need to tell this awk script the two "external" variables from the shell. i.e. the awk variable id is assigned the same value as the shell variable $id, and the same for the awk variable pos and the shell variable $pos.
$1 == id && pos > $4 && pos < $5: this is the "conditional" part of the awk script. If these conditions are met, then the following commands will run. Here, we are checking if the first field $1 of file2 is the same as the id from the current line of file1, AND if pos is between the 4th $4 and 5th $5 fields of file2.
{ print gensub(/.*gene=([A-Za-z0-9]*).*/, "\\1", 1) }: if the above conditions are met, then this code will run. We want to make a substitution with gensub first. This searches for gene= followed by an alphanumeric string of any length ([A-Za-z0-9]*). This alphanumeric string is ( captured ) by the parens. We will also "search" for all characters before and after .* the full string gene=([A-Za-z0-9]*). Hence, this "searches" for the entire line, which is replaced with the (first and only) capturing group "\\1", i.e. the alphanumeric string after gene=. The final 1 means to replace the first occurance, although this doesn't make much sense, as I assume there will only be one match of gene= per line.

Tab-delimited version

I prefer using tab-delimited files in general, especially for what I assume is a GFF/GTF file. This allows differentiating between spaces, especially in the 9th field.

while IFS=$'\t' read -r id pos; do awk -F'\t' -v id="$id" -v pos="$pos" '$1 == id && pos > $4 && pos < $5 { print gensub(/.*gene=([A-Za-z0-9]*).*/, "\\1", 1) }' <file2.tsv ; done <file1.tsv

The modifications to the script are explicitly splitting shell lines on tabs with IFS=$'\t', and awk lines with -F'\t'.

Thanks, @Sparhawk!. The code you provided was printing the entire row for hits that matched the < and > condition. However, I wasn't clear enough in my questino, I only wanted the gene name as the output. My colleague modified your code to get it to work as I intended. — Age87
– Age87, Commented Mar 5, 2018 at 6:27
@Age87 Huh… it shouldn't have done that. I did understand the requirements. Also, I pasted the output in the first code block. How did you modify the code? I'd be interested as to what was changed. — Sparhawk
– Sparhawk, Commented Mar 5, 2018 at 6:29

Stack Exchange Network

extract word from file using information from a subset of the file (multiple steps)

1 Answer 1

Explanation

Tab-delimited version

You must log in to answer this question.

Hot Network Questions

extract word from file using information from a subset of the file (multiple steps)

1 Answer 1

Explanation

Tab-delimited version

You must log in to answer this question.

Related

Hot Network Questions