0

I am very new to unix!

trying to figure out, from a fastq file how many reads have 3 or MORE As in a row?

I used egrep 'A{3}' to tell me how many AAA I have. But now I want to know >= 3 AAA in a row. But >= doesn't work. Can I use awk to help me determine this?

Also, how can I use regular expression to determine How many reads have a run of 4 or more As followed by something other than a T? (G C or A) So A has to be >= 4, and followed by GCorA

EDIT: When I mean to say 3As in a row, I mean something like this: GGCTAAAAAACGGAT

2
  • 2
    Welcome! Could you post a sample text and the desired output? Commented Mar 25, 2020 at 3:43
  • When I read I want to know >= 3 AAA in a row I thought you were trying to get a count of lines where AAA appears 3 or more times on a line, e.g. fooAAAbarAAAetcAAAetc. but all the answers so far are interpreting your question differently from me and, in some cases, from each other. Please edit your question to clarify your requirements and provide concise, testable sample input and expected output that adequately demonstrates those requirements. Commented Mar 25, 2020 at 13:51

3 Answers 3

0

If you three or more as you could use: a{3,}. For example:

$ echo a | grep -E 'a{3,}' $ echo aa | grep -E 'a{3,}' $ echo aaa | grep -E 'a{3,}' aaa $ echo aaaa | grep -E 'a{3,}' aaaa $ echo aaaaaaaaaa | grep -E 'a{3,}' aaaaaaaaaa 

If you want 3 or more as followed by something that's not a t, you could use a{3,}[^t]. For example:

$ echo aaa | grep -E 'a{3,}[^t]' $ echo aaat | grep -E 'a{3,}[^t]' $ echo aaax | grep -E 'a{3,}[^t]' aaax 

Note, however, that an a isn't a t, so that'll match something like 'aaaa'; the first three as followed by a character that isn't a t (in this case, a).

$ echo aaaa | grep -E 'a{3,}[^t]' aaaa 

If you want the string to end in something that is neither a nor t, you could use: a{3,}[^at]. For example:

$ echo aaaa | grep -E 'a{3,}[^ta]' $ echo aaaaaaaa | grep -E 'a{3,}[^ta]' $ echo aaaaaaaattt | grep -E 'a{3,}[^ta]' $ echo aaaaaaaab | grep -E 'a{3,}[^ta]' aaaaaaaab 
0
0

To print the count of sequences of three or more As, try

awk '{print gsub (/AAAA*/, "&")}' file 3 4 4 1 

For your second request, adapt above like

awk '{print gsub (/AAAAA*[CG]/, "&")}' file 

"followed by A" is already covered by the A* pattern.

1
  • You can use /AAA+/ or /A{3,}/ instead of /AAAA*/. Commented Mar 25, 2020 at 13:45
0

from a fastq file how many reads have 3 or MORE As in a row

Since it's a fastq format file, you only want to look at the actual sequence lines, not all lines, to get an accurate count. You can do this by using the NR variable to restrict matches to just the second line of each 4-line sequence block:

awk 'NR%4 == 2 && /AAA/ { count++ } END { print count+0 }' foo.fastq 

How many reads have a run of 4 or more As followed by something other than a T? (G C or A)

awk 'NR%4 == 2 && /AAAA([^T]|$)/ { count++ } END { print count+0 }' foo.fastq 

(Note that this will match AAAAAT because it has 4 A's followed by another A)

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.