1

I am trying to extract lines from file genome.gff that contain a line from file suspicious.txt. suspicious.txt was derived from genome.gff and every line should match.

Using grep on a single line from suspicious.txt works as expected:

grep 'gene10002' genome.gff NC_007082.3 Gnomon gene 1269632 1273520 . + . ID=gene10002;Dbxref=BEEBASE:GB54789,GeneID:409846;Name=bur;gbkey=Gene;gene=bur;gene_biotype=protein_coding NC_007082.3 Gnomon mRNA 1269632 1273520 . + . ID=rna21310;Parent=gene10002;Dbxref=GeneID:409846,Genbank:XM_393336.5,BEEBASE:GB54789;Name=XM_393336.5;gbkey=mRNA;gene=bur;product=burgundy;transcript_id=XM_393336.5 

But every variation on using grep from a file that I've been able to think of or find online produces no output or an empty file:

grep -f suspicious.txt genome.gff grep -F -f suspicious.txt genome.gff while read line; do grep "$line" genome.gff; done<suspicious.txt while read line; do grep '$line' genome.gff; done<suspicious.txt while read line; do grep "${line}" genome.gff; done<suspicious.txt cat suspicious.txt | while read line; do grep '$line' genome.gff; done cat suspicious.txt | while read line; do grep '$line' genome.gff >> suspicious.gff; done cat suspicious.txt | while read line; do grep -e "${line}" genome.gff >> suspicious.gff; done cat "$(cat suspicious_bee_geneIDs_test.txt)" | while read line; do grep -e "${line}" genome.gff >> suspicious.gff; done 

Running it as a script also produces an empty file:

#!/bin/bash SUSP=$1 GFF=$2 while read -r line; do grep -e "${line}" $GFF >> suspicious_bee_genes.gff done<$SUSP 

This is what the files look like:

head genome.gff ##gff-version 3 #!gff-spec-version 1.21 #!processor NCBI annotwriter #!genome-build Amel_4.5 #!genome-build-accession NCBI_Assembly:GCF_000002195.4 ##sequence-region NC_007070.3 1 29893408 ##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=7460 NC_007070.3 RefSeq region 1 29893408 . + . ID=id0;Dbxref=taxon:7460;Name=LG1;gbkey=Src;genome=chromosome;linkage- group=LG1;mol_type=genomic DNA;strain=DH4 NC_007070.3 Gnomon gene 181 211962 . - . ID=gene0;Dbxref=BEEBASE:GB42164,GeneID:726912;Name=cort;gbkey=Gene;gene=cort;gene_biotype=protein_coding NC_007070.3 Gnomon mRNA 181 71559 . - . ID=rna0;Parent=gene0;Dbxref=GeneID:726912,Genbank:XM_006557348.1,BEEBASE:GB42164;Name=XM_006557348.1;gbkey=mRNA;gene=cort;product=cortex%2C transcript variant X2;transcript_id=XM_006557348.1 wc -l genome.gff 457742 head suspicious.txt gene10002 gene1001 gene1003 gene10038 gene10048 gene10088 gene10132 gene10134 gene10181 gene10209 wc -l suspicious.txt 928 

Does anyone know what's going wrong here?

3
  • In you sample code, none of the suspicious.txt patterns are present in genome.gff. Adding a simple gene string in suspicious.txt produce the expected result for me with your script. Commented Mar 26, 2016 at 6:48
  • could you add ouput of these cmds? cat -v suspicious.txt | head ; ( tail suspicious.txt ; printf "//end\n" "";) Commented Mar 26, 2016 at 9:59
  • could you edit your script? insert a line after line 1: set -x . Then save and run again the script, add the output to the question Commented Mar 26, 2016 at 10:00

1 Answer 1

2

This can happen when the input file is in DOS format: each line will have a trailing CR character at the end, which will break the matching.

One way to check if this is the case is using hexdump, for example (just the first few lines):

$ hexdump -C suspicious.txt 00000000 67 65 6e 65 31 30 30 30 32 0d 0a 67 65 6e 65 31 |gene10002..gene1| 00000010 30 30 31 0d 0a 67 65 6e 65 31 30 30 33 0d 0a 67 |001..gene1003..g| 00000020 65 6e 65 31 30 30 33 38 0d 0a 67 65 6e 65 31 30 |ene10038..gene10| 

In the ASCII representation at the right, notice the .. after each gene. These dots correspond to 0d and 0a. The 0d is the CR character.

Without the CR character, the output should look like this:

$ hexdump -C <(tr -d '\r' < suspicious.txt) 00000000 67 65 6e 65 31 30 30 30 32 0a 67 65 6e 65 31 30 |gene10002.gene10| 00000010 30 31 0a 67 65 6e 65 31 30 30 33 0a 67 65 6e 65 |01.gene1003.gene| 00000020 31 30 30 33 38 0a 67 65 6e 65 31 30 30 34 38 0a |10038.gene10048.| 

Just one . after each gene, corresponding to 0a, and no 0d.

Another way to see the DOS line endings in the vi editor. If you open the file with vi, the status line would show [dos], or you could run the ex command :set ff? to make it tell you the file format (the status line will say fileformat=dos).

You can remove the CR characters on the fly like this:

grep -f <(tr -d '\r' < suspicious.txt) genome.gff 

Or you could remove in vi, by running the ex command :set ff=unix and then save the file. There are other command line tools too that can remove the DOS line ending.

Another possibility is that instead of a trailing CR character, you might have trailing whitespace. The output of hexdump -C should make that perfectly clear. After the trailing whitespace characters are removed, the grep -f should work as expected.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.