7

I have a file of genomic data that is approximately 5 million lines long and should have only the characters A, T, C, and G in it. The problem is, I know how large the file should be, but it's slightly larger than that. Which means, something went wrong in an analysis, or there are lines that contain something other than genomic data.

Is there a way to find any line that has something other than an A, T, C, or G? Due to the nature of the file, any other letter, spaces, numbers, symbols shouldn't be present. I've gone through searching symbol by symbol, so I was hoping there would be an easier way.

2
  • 8
    Does it necessarily have to be in vi? Maybe grep -e "[^ATCG]" also works? Commented Aug 31, 2018 at 15:46
  • Is each like one column, or four columns with A, T, C and G in any order? Commented Sep 1, 2018 at 7:24

2 Answers 2

19

First of all, you definitely do not want to open the file in an editor (it's much too large to edit that way).

Instead, if you just want to identify whether the file contains anything other than A, T, C and G, you may do that with

grep '[^ATCG]' filename 

This would return all lines that contain anything other than those four characters.

If you would want to delete these characters from the file, you may do so with

tr -c -d 'ATCG\n' <filename >newfilename 

(if this is the correct way to "correct" the file or not, I don't know)

This would remove all characters in the file that are not one of the four, and it would also retain newlines (\n). The edited file would be written to newfilename.

If it's a systematic error that has added something to the file, then this could possibly be corrected by sed or awk, but we don't yet know what your data looks like.


If you have the file open in vi or vim, then the command

/[^ATCG] 

will find the next character in the editing buffer that is not a A, T, C or G.

And :%s/[^ATCG]//g will remove them all.

5
  • 2
    the --line-number (-n) argument to grep could be useful for obvious reasons Commented Sep 1, 2018 at 0:48
  • Doesn't grep '[^ATCG]' just look for A, T, C & G in the first column? Commented Sep 1, 2018 at 7:23
  • @RonJohn No. The ^ has a totally different meaning when it occurs as the first character within [...] (it negates the character class). Commented Sep 1, 2018 at 7:31
  • I knew I should have added "not"... :) Thus, doesn't grep '[^ATCG]' look for any row which does not have A, T, C & G in the first column? Commented Sep 1, 2018 at 7:48
  • @RonJohn No. It looks for lines of text where at least one character is not A, T, C or G. The [...] means "one of these characters", and with the ^ at the start of that group, it means "a character, but not one of these". Commented Sep 1, 2018 at 8:04
0

I focused on the title

Find any line in VI that has something other than ATCG

And I tested this from VI editor so called "last line mode".

:%!tr -c -d 'ATCG\n

: enters command-line mode, % matches whole file as a range, ! filters that range through an external command tr -c -d 'ATCG\n that happens to be the same as @Kusalananda wrote :).

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.