0

I am trying to find amino acid sequences that start with either L or A and then end in either L or A with two amino acids in between each instance of L or A.

This is what I have:

re.findall("A|L.{2}A|L", string1)

output:

['A', 'L', 'A', 'L', 'A', 'A', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'LNLA', 'A', 'L', 'L', 'L', 'L', 'A', 'L', 'A', 'L', 'L', 'L', 'A', 'A', 'A', 'A', 'LACA', 'A', 'L', 'L', 'L', 'A', 'A', 'A', 'A', 'A', 'A', 'L', 'LPYA', 'A', 'A', 'A', 'A', 'L', 'L', 'A', 'A', 'A']

I assume the extra L's and A's have something to do with the syntax, but I'm not sure what this | is exactly doing.

1
  • 1
    It would help the question a lot if you provided sample input, with the desired output. Without seeing your input, I agree with @kay3 that it sounds like you want [AL].{2}[AL]. Commented Sep 26, 2022 at 0:47

1 Answer 1

0

| means "or". So "A|L" means to match a char "A" or "L".

So "A|L.{2}A|L" means to match "A" or "L.{2}A" or "L".

Perhaps what you want is: "[AL].{2}[AL]"

where [ ] is a set of characters, re will match any char in the [ ]. For example "[AaLy]" will match 'A' or 'a' or 'L' or 'y'.

import re string1="AALLLNLALPYAL" out0=re.findall("[A|L].{2}[A|L]", string1) print(out0) 

Output

['AALL', 'LNLA', 'LPYA'] 

However, "(A|L).{2}(A|L)" means search by first group in first parenthesis which is A or L, then after arbitrary 2 characters, search the 2nd group by 2nd parenthesis which is A or L. So the output would look different:

out1=re.findall("(A|L).{2}(A|L)", string1) print(out1) 

Output

[('A', 'L'), ('L', 'A'), ('L', 'A')] 

Well this may not be what you want, right?

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.