GenBank to FASTA format using regular expressions without Biopython

Question

This is a Biopython alternative with pretty straightforward code. How can I make this more concise?

def genbank_to_fasta(): file = input(r'Input the path to your file: ') with open(f'{file}') as f: gb = f.readlines() locus = re.search('NC_\d+\.\d+', gb[3]).group() region = re.search('(\d+)?\.+(\d+)', gb[2]) definition = re.search('\w.+', gb[1][10:]).group() definition = definition.replace(definition[-1], "") tag = locus + ":" + region.group(1) + "-" + region.group(2) + " " + definition sequence = "" for line in (gb): pattern = re.compile('[a,t,g,c]{10}') matches = pattern.finditer(line) for match in matches: sequence += match.group().upper() end_pattern = re.search('[a,t,g,c]{1,9}', gb[-3]) sequence += end_pattern.group().upper() print(len(sequence)) return sequence, tag

@Emma Those numbers are not always 8 digits long it depends. Here are the links for the reference I use. ncbi.nlm.nih.gov/nuccore/… (GenBank file) ncbi.nlm.nih.gov/nuccore/… (FASTA file). — Ethan Hetrick
– Ethan Hetrick, Commented Jun 26, 2020 at 2:38
@Emma that's good to know. My main problem came with the sequence. If the last group of DNA was not a group of 10, my current code will not parse it so I had to write the end_pattern pattern in order to get the last one. I think there is a better way to do it but I'm not sure. — Ethan Hetrick
– Ethan Hetrick, Commented Jun 26, 2020 at 2:53
What is the order of the size of the input files? Kilobytes? Gigabytes? — Peter Mortensen
– Peter Mortensen, Commented Jun 26, 2020 at 8:01

Reinderien · Accepted Answer · 2020-06-26 02:57:48Z

Line iteration

 gb = f.readlines() locus = re.search('NC_\d+\.\d+', gb[3]).group() region = re.search('(\d+)?\.+(\d+)', gb[2]) definition = re.search('\w.+', gb[1][10:]).group() for line in (gb): # ... end_pattern = re.search('[a,t,g,c]{1,9}', gb[-3])

can be

next(f) definition = re.search('\w.+', next(f)[10:]).group() region = re.search('(\d+)?\.+(\d+)', next(f)) locus = re.search('NC_\d+\.\d+', next(f)).group() gb = tuple(f) for line in gb:

Since you need gb[-3], you can't get away with a purely "streamed" iteration. You could be clever and keep a small queue of the last few entries you've read, if you're deeply worried about memory consumption.

Debugging statements

Remove this:

 print(len(sequence))

or convert it to a logging call at the debug level.

Compilation

Do not do this:

 pattern = re.compile('[a,t,g,c]{10}')

in an inner loop. The whole point of compile is that it can be done once to save the cost of re-compilation; so a safe option is to do this at the global level instead.

Stack Exchange Network

GenBank to FASTA format using regular expressions without Biopython

1 Answer 1

Line iteration

Debugging statements

Compilation

You must log in to answer this question.

Linked

Hot Network Questions

GenBank to FASTA format using regular expressions without Biopython

1 Answer 1

Line iteration

Debugging statements

Compilation

You must log in to answer this question.

Linked

Related

Hot Network Questions