6
\$\begingroup\$

This is a Biopython alternative with pretty straightforward code. How can I make this more concise?

def genbank_to_fasta(): file = input(r'Input the path to your file: ') with open(f'{file}') as f: gb = f.readlines() locus = re.search('NC_\d+\.\d+', gb[3]).group() region = re.search('(\d+)?\.+(\d+)', gb[2]) definition = re.search('\w.+', gb[1][10:]).group() definition = definition.replace(definition[-1], "") tag = locus + ":" + region.group(1) + "-" + region.group(2) + " " + definition sequence = "" for line in (gb): pattern = re.compile('[a,t,g,c]{10}') matches = pattern.finditer(line) for match in matches: sequence += match.group().upper() end_pattern = re.search('[a,t,g,c]{1,9}', gb[-3]) sequence += end_pattern.group().upper() print(len(sequence)) return sequence, tag 
\$\endgroup\$
5
  • 1
    \$\begingroup\$ @Emma here is an example. 'REGION: 26156329..26157115' \$\endgroup\$ Commented Jun 26, 2020 at 2:00
  • 1
    \$\begingroup\$ @Emma Those numbers are not always 8 digits long it depends. Here are the links for the reference I use. ncbi.nlm.nih.gov/nuccore/… (GenBank file) ncbi.nlm.nih.gov/nuccore/… (FASTA file). \$\endgroup\$ Commented Jun 26, 2020 at 2:38
  • 1
    \$\begingroup\$ @Emma that's good to know. My main problem came with the sequence. If the last group of DNA was not a group of 10, my current code will not parse it so I had to write the end_pattern pattern in order to get the last one. I think there is a better way to do it but I'm not sure. \$\endgroup\$ Commented Jun 26, 2020 at 2:53
  • \$\begingroup\$ What is the order of the size of the input files? Kilobytes? Gigabytes? \$\endgroup\$ Commented Jun 26, 2020 at 8:01
  • \$\begingroup\$ @PeterMortensen I believe it is in kilobytes \$\endgroup\$ Commented Jun 26, 2020 at 23:25

1 Answer 1

5
\$\begingroup\$

Line iteration

 gb = f.readlines() locus = re.search('NC_\d+\.\d+', gb[3]).group() region = re.search('(\d+)?\.+(\d+)', gb[2]) definition = re.search('\w.+', gb[1][10:]).group() for line in (gb): # ... end_pattern = re.search('[a,t,g,c]{1,9}', gb[-3]) 

can be

next(f) definition = re.search('\w.+', next(f)[10:]).group() region = re.search('(\d+)?\.+(\d+)', next(f)) locus = re.search('NC_\d+\.\d+', next(f)).group() gb = tuple(f) for line in gb: 

Since you need gb[-3], you can't get away with a purely "streamed" iteration. You could be clever and keep a small queue of the last few entries you've read, if you're deeply worried about memory consumption.

Debugging statements

Remove this:

 print(len(sequence)) 

or convert it to a logging call at the debug level.

Compilation

Do not do this:

 pattern = re.compile('[a,t,g,c]{10}') 

in an inner loop. The whole point of compile is that it can be done once to save the cost of re-compilation; so a safe option is to do this at the global level instead.

\$\endgroup\$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.