DNA Translator and Verifier (using BLAST)

Question

Proteins are chains of amino acids. Amino acids are coded by codons, a sequence of 3 DNA/RNA molecules. DNA also has 3 open reading frames. This is basically the DNA sequence, but shift it by 1 (i.e. ignore the first entry). Thus, you will have 3 different translations (no skipping, skip 1st entry, skip 2nd entry). Additionally, for some sequencing techniques, the length of the DNA they can sequence is short. Thus, you may need to sequence forward, and backwards (-f and -r in my code). Finally, these amino acids sequences start with a specific codon, and end with specific codons.

This code takes the DNA, translates it to an amino acid using the start and stop codons as borders. It offers the user 3 options, either only forward sequencing or reverse sequencing (where the dna sequence needs to be reversed, and then complemented), or a combination using both the forward and reverse. If both is picked, the script then looks for a point of intersection, and combines the forward and reverse at that intersection. Furthermore, it offers the user to pick between all the potential sequences found. Finally, it uses BLAST to search the sequence picked against a database, to confirm the identity of the protein.

A basic schematic:

#DNA AGTTGCGC #translated 1st reading frame: MC 2nd reading frame: VA 3rd reading frame: LR #since only 1st reading frame has seq that starts with M #sequence to search MC #Blast will search MC

That's the basic idea.

I'm not very familiar with functions (its why I have randomly assigned globals at the bottom, it's my "cheating" way of trying to make everything work. Additionally, this is also my first time trying to design user inputs in the terminal and using those as "flags" (i.e. if user types this in, do this). In its current state its a little ugly (in both the main_loop and reverse/forward loops I have dependencies on user input and multiple nested loops).

Thus I'm looking for 2 things:

A way to clean up some of the user input lines so I don't have this multiple nested main loop. And feedback on the design/structure and use of my functions.
Is the code structured/properly is it clean? Are the methodologies used "best practices". In other words, are there better ways to do what I am attempting to do.

I'm writing this program in the aim of learning how to write longer/cleaner programs, learn how to design my program to work via terminal (instead of GUI), and an excuse to learn selenium as well (although I do think it has some practical applications as well).

To run: python script.py -f forward_file.txt -r reverse_file.txt The correct option to pick when presented with the translations is 1 and 0

from selenium import webdriver from selenium.webdriver.common.keys import Keys import sys dna_codon_dict={'TTT':'F','TTC':'F', 'TTA':'L','TTG':'L', 'CTT':'L','CTC':'L', 'CTA':'L','CTG':'L', 'ATT':'I','ATC':'I', 'ATA':'I','ATG':'M', 'GTT':'V','GTC':'V', 'GTA':'V','GTG':'V', 'TCT':'S','TCC':'S', 'TCA':'S','TCG':'S', 'CCT':'P','CCC':'P', 'CCA':'P','CCG':'P', 'ACT':'T','ACC':'T', 'ACA':'T','ACG':'T', 'GCT':'A','GCC':'A', 'GCA':'A','GCG':'A', 'TAT':'Y','TAC':'Y', 'CAT':'H','CAC':'H', 'CAA':'Q','CAG':'Q', 'AAT':'N','AAC':'N', 'AAA':'K','AAG':'K', 'GAT':'D','GAC':'D', 'GAA':'E','GAG':'E', 'TGT':'C','TGC':'C', 'TGG':'W','CGT':'R', 'CGC':'R','CGA':'R', 'CGG':'R','AGT':'S', 'AGC':'S','AGA':'R', 'AGG':'R','GGT':'G', 'GGC':'G','GGA':'G', 'GGG':'G'} DNA_complement_dict={'A':'T', 'T':'A', 'G':'C', 'C':'G', 'N':'N'} def load_file(files): codon_list=[] with open(files) as seq_result: for lines in seq_result: if lines.startswith('>') is True: continue remove_white_spaces=lines.strip().upper() for codon in remove_white_spaces: codon_list.append(codon) return codon_list def rev(files): reverse_codon_list=[] codon_list=load_file(files) codon_list.reverse() for codons in codon_list: reversed_codon=DNA_complement_dict[codons] reverse_codon_list.append(reversed_codon) return reverse_codon_list def codon_translation(global_codon_list): codon_counter=0 codon_triple_list=[] open_reading_frame_lists=[[],[],[],] for i in range(3): open_reading_frame_count=1 codon_triple_list.clear() codon_counter=0 for codons in global_codon_list: if open_reading_frame_count>=(i+1): codon_counter+=1 codon_triple_list.append(codons) if codon_counter == 3: codon_counter=0 join_codons=''.join(codon_triple_list) try: amino_acid=dna_codon_dict[join_codons] open_reading_frame_lists[i].append(amino_acid) except: pass if join_codons in {'TAA','TAG','TGA'}: open_reading_frame_lists[i].append('X') codon_triple_list.clear() else: open_reading_frame_count+=1 return open_reading_frame_lists def find_open_reading_frames(global_codon_list): sequences_to_search=[] sequence_to_add_to_search_list=[] add_to_string=False for open_reading_frames in codon_translation(global_codon_list): for amino_acids in open_reading_frames: if amino_acids == 'M': add_to_string=True if add_to_string is True: sequence_to_add_to_search_list.append(amino_acids) if amino_acids == 'X': add_to_string=False if len(sequence_to_add_to_search_list)>0: sequences_to_search.append(''.join(sequence_to_add_to_search_list)) sequence_to_add_to_search_list.clear() else: sequence_to_add_to_search_list.clear() return sequences_to_search def forward_loop(): files=sys.argv[2] forward_flag=False if sys.argv[1] == '-f': forward_flag=True if forward_flag is True: codon_list=load_file(files) return codon_list def reverse_loop(): if sys.argv[1] == '-f': revsere_flag=False try: if sys.argv[3] == '-r': files=sys.argv[4] reverse_flag=True if reverse_flag is True: codon_list=rev(files) return codon_list except: pass else: files=sys.argv[2] reverse_flag=False if sys.argv[1] == '-r': reverse_flag=True if reverse_flag is True: codon_list=rev(files) return codon_list def overlay(sequence_list1,sequence_list2): new_list1=[word for line in sequence_list1 for word in line] new_list2=[word for line in sequence_list2 for word in line] temp_list=[] modified_list1=[] counter=0 for x in new_list1: temp_list.append(x) modified_list1.append(x) counter+=1 if counter >= 5: if temp_list == new_list2[0:5]: break else: temp_list.pop((0)) del new_list2[0:5] return ''.join(modified_list1+new_list2) sequence_list1=[] sequence_list2=[] global_codon_list=[] def main_loop(): global global_codon_list global sequence_list1 global sequence_list2 if sys.argv[1] == '-f': global_codon_list=forward_loop() sequences_to_search=find_open_reading_frames(global_codon_list) sequence_to_search=[] for sequence,number in zip(sequences_to_search,range(len(sequences_to_search))): print(f'row {number} sequence: {sequence}') sequence_to_search.append(sequence) pick_sequence_to_search=input('indicate which row # sequence to search: ') sequence_list1.append(sequence_to_search[int(pick_sequence_to_search)]) try: if sys.argv[3] == '-r': global_codon_list=reverse_loop() sequences_to_search=find_open_reading_frames(global_codon_list) sequence_to_search=[] for sequence,number in zip(sequences_to_search,range(len(sequences_to_search))): print(f'row {number} sequence: {sequence}') sequence_to_search.append(sequence) pick_sequence_to_search=input('indicate which row # sequence to search: ') sequence_list2.append(sequence_to_search[int(pick_sequence_to_search)]) except: pass else: sequence_to_search=[] global_codon_list=reverse_loop() sequences_to_search=find_open_reading_frames(global_codon_list) for sequence,number in zip(sequences_to_search,range(len(sequences_to_search))): print(f'row {number} sequence: {sequence}') sequence_to_search.append(sequence) pick_sequence_to_search=input('indicate which row # sequence to search: ') sequence_list1.append(sequence_to_search[int(pick_sequence_to_search)]) main_loop() driver = webdriver.Chrome() driver.get('https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome') fill_box = driver.find_element_by_xpath('/html/body/div[2]/div/div[2]/form/div[3]/fieldset/div[1]/div[1]/textarea') fill_box.clear() fill_box.send_keys(overlay(sequence_list1,sequence_list2)) sumbit_button=driver.find_element_by_xpath('/html/body/div[2]/div/div[2]/form/div[6]/div/div[1]/div[1]/input') sumbit_button.click()

#DNA forward >Delta_fl_pETDuet_1F NNNNNNNNNNNNNNNNANTTAATACGACTCACTATAGGGGAATTGTGAGCGGATAACAATTCCCCTCTAGAAATAATTTT GTTTAACTTTAAGAAGGAGATATACCATGGGCAGCAGCCATCACCATCATCACCACAGCCAGGATCCAATGATTCGGTTG TACCCGGAACAACTCCGCGCGCAGCTCAATGAAGGGCTGCGCGCGGCGTATCTTTTACTTGGTAACGATCCTCTGTTATT GCAGGAAAGCCAGGACGCTGTTCGTCAGGTAGCTGCGGCACAAGGATTCGAAGAACACCACACTTTTTCCATTGATCCCA ACACTGACTGGAATGCGATCTTTTCGTTATGCCAGGCTATGAGTCTGTTTGCCAGTCGACAAACGCTATTGCTGTTGTTA CCAGAAAACGGACCGAATGCGGCGATCAATGAGCAACTTCTCACACTCACCGGACTTCTGCATGACGACCTGCTGTTGAT CGTCCGCGGTAATAAATTAAGCAAAGCGCAAGAAAATGCCGCCTGGTTTACTGCGCTTGCGAATCGCAGCGTGCAGGTGA CCTGTCAGACACCGGAGCAGGCTCAGCTTCCCCGCTGGGTTGCTGCGCGCGCAAAACAGCTCAACTTAGAACTGGATGAC GCGGCAAATCAGGTGCTCTGCTACTGTTATGAAGGTAACCTGCTGGCGCTGGCTCAGGCACTGGAGCGTTTATCGCTGCT CTGGCCAGACGGCAAATTGACATTACCGCGCGTTGAACAGGCGGTGAATGATGCCGCGCATTTCACCCCTTTTCATTGGG TTGATGCTTTGTTGATGGGAAAAAGTAAGCGCGCATTGCATATTCTTCAGCAACTGCGTCTGGAAGGCAGCGAACCGGTT ATTTTGTTGCGCACATTAN #DNA Reverse >Delta_FL_pETDuet_R-T7-Term_B12.ab1 NNNNNNNNNNNNNAGCTGCGCTAGTAGACGAGTCCATGTGCTGGCGTTCAAATTTCGCAGCAGCGGTTTCTTTACCAGAC TCGAGTTAACCGTCGATAAATACGTCCGCCAGGGGTTTATGGCACAACAGAAGAGATAACCCTTCCAGCTCTGCCCACAC TGACTGACCGTAATCTTGTTTGAGGGTGAGTTCCGTTCGTGTCAGGAGTTGCACGGCCTGACGTAACTGCGTCTGACTTA AGCGATTTAACGCCTCGCCCATCATGCCCCGGCGGTTCTGCCATACCCGATGCTTATCAAACAACGCACGCAGTGGCGTA TGGGCAGACTGGCGTTTCAGGTTAACCAGTAACAACAGTTCACGTTGTAATGTGCGCAACAAAATAACCGGTTCGCTGCC TTCCAGACGCAGTTGCTGAAGAATATGCAATGCGCGCTTACTTTTTCCCATCAACAAAGCATCAACCCAATGAAAAGGGG TGAAATGCGCGGCATCATTCACCGCCTGTTCAACGCGCGGTAATGTCAATTTGCCGTCTGGCCAGAGCAGCGATAAACGC TCCAGTGCCTGAGCCAGCGCCAGCAGGTTACCTTCATAACAGTAGCAGAGCACCTGATTTGCCGCGTCATCCAGTTCTAA GTTGAGCTGTTTTGCGCGCGCAGCAACCCAGCGGGGAAGCTGAGCCTGCTCCGGTGTCTGACAGGTCACCTGCACGCTGC GATTCGCAAGCGCAGTAAACCACGCGGCATTTTCTTGCGCTTTGCTTAATTTATTACCGCGGACGATCAACAGCNNNCGT CATGCAGAAGTCCGGTGAGTGTGAGAAGTTGCTCATNGATCGCCCGCATTCGGNCCGTTTTCTGGTANCANCAGNNATAC CGTTTGTCGANTGGCAAACANACN

I think this is roughly on-topic; but can you edit the question to include your concerns about the code and what you're looking for in a review? — Reinderien
– Reinderien, Commented Jul 3, 2020 at 18:00
well, if it's biology that you are interested in , python has lots of packages for chemistry, biology and physics. You may be interested in this biopython.org . Not related to the answer but maybe useful. These libraries will be already having those encoding of amino acids and much more. I haven't used them but it's expected. — Vishesh Mangla
– Vishesh Mangla, Commented Jul 5, 2020 at 7:39
@VisheshMangla seems like I reinvented the wheel here, almost everything my program does theirs does. But the above is less for practical use, and more for me to learn just how to use python. But thank you, lots of useful tools in that package. — samman
– samman, Commented Jul 8, 2020 at 0:41
That's great if it helped, I just randomly searched google for "python pip package biology". I don't know what's inside. — Vishesh Mangla
– Vishesh Mangla, Commented Jul 8, 2020 at 5:50

user226435 · Accepted Answer · 2020-07-04 20:46:56Z

def load_file(files): codon_list=[] with open(files) as seq_result: for lines in seq_result: if lines.startswith('>') is True: continue remove_white_spaces=lines.strip().upper() for codon in remove_white_spaces: codon_list.append(codon) return codon_list

There is almost never a good reason to use is True, just remove that and your code will still work correctly.

We can remove remove_white_spaces by moving lines.strip().upper(), this makes the code easier to read as we now don't need to check if remove_white_spaces is being used again.

We can use a list comprehension instead to build codon_list, this is syntatic sugar that has increased the readability of lots of Python code.

You are using incorrectly using plurals, files and lines. You can also use path instead of files and sequence instead of seq_result.

def load_file(path): with open(path) as sequence: return [ codon for line in sequence if not line.startswith('>') for codon in line.strip().upper() ]

def rev(files): reverse_codon_list=[] codon_list=load_file(files) codon_list.reverse() for codons in codon_list: reversed_codon=DNA_complement_dict[codons] reverse_codon_list.append(reversed_codon) return reverse_codon_list

Much like the previous function you can use a comprehension, and reversed_codon only impairs readability.

We can use the function reversed rather than list.reverse to reverse the list, to reduce line count and improve readability.

def rev(files): return [ DNA_complement_dict[codons] for codons in reversed(load_file(files)) ]

def codon_translation(global_codon_list): codon_counter=0 codon_triple_list=[] open_reading_frame_lists=[[],[],[],] for i in range(3): open_reading_frame_count=1 codon_triple_list.clear() codon_counter=0 for codons in global_codon_list: if open_reading_frame_count>=(i+1): codon_counter+=1 codon_triple_list.append(codons) if codon_counter == 3: codon_counter=0 join_codons=''.join(codon_triple_list) try: amino_acid=dna_codon_dict[join_codons] open_reading_frame_lists[i].append(amino_acid) except: pass if join_codons in {'TAA','TAG','TGA'}: open_reading_frame_lists[i].append('X') codon_triple_list.clear() else: open_reading_frame_count+=1 return open_reading_frame_lists

Your code is hard to read as your whitespace is not great and not consistant. If you put a space either side of all operators it will help readability.

You can use len(codon_triple_list) rather than codon_counter, this cuts out a siginifact amount of code improving readability.

You shouldn't have bare exepcts, except:, these catch too much and lead to problems. You should either use except KeyError: or make it so there is no exception.

You should have either a second dictionary that contains TAA, TAG and TGA.

You can inverse open_reading_frame_count>=(i+1) to reduce the level of the arrow anti-pattern you have.

You have some really verbose names, making your code harder to read. Which is quicker to read triples or codon_triple_list?

def codon_translation(codons): reading_frames = ([], [], []) for i, reading_frame in enumerate(reading_frames): open_reading_frame_count = 1 triples = [] for codon in codons: if open_reading_frame_count <= i: open_reading_frame_count += 1 continue triples += [codon] if len(triples) == 3: reading_frame.append(dna_codon_dict2[''.join(triples)]) triples = [] return reading_frames

You can remove the need for open_reading_frame_count by simply slicing codons by i.

You can build a windowed function to get triplets easily.

We can convert this into a nested comprehension.

def windowed(values, size): return zip(*size*[iter(values)]) def codon_translation(codons): return [ [ dna_codon_dict2[''.join(triplet)] for triplet in windowed(codons[i:], 3) if ''.join(triplet) in dna_codon_dict2 ] for i in range(3) ]

Wow thank you! I just wanted to add a comment saying I've seen it and am going through it, it's just this is quite a bit (and I'm also unfamiliar with list comprehensions), so it's been taking me sometime to go through all the modifications/improvements. Thank you again! — samman
– samman, Commented Jul 4, 2020 at 20:17
@samman No problem! If you find anything particularly confusing just shout :) — user226435
– user226435, Commented Jul 4, 2020 at 20:18
There are some things that jump out at me. 1) What's wrong with plurals? I don't quite understand the change to the nomenclature. 2) Is there any advantage to redefining the list as empty versus using .clear() triples=[] 3) I believe enumerate starts at 0, and we want our first value at 1. Can you just do ((reading_frames),1)? 4) The reason for try is because there will be values (that I cannot predict ahead of time) that will not be in my dict. However, I only want to append values that are in my dict. 5) Are nested listed comprehensions more readable though? It might be that I'm stil inadep — samman
– samman, Commented Jul 4, 2020 at 20:34
at reading list comprehensions, but I find the final box with the nested list comprehension to be significantly harder to read/follow than the one without it. — samman
– samman, Commented Jul 4, 2020 at 20:35
1) If you name something in its plural form, then you're saying there are 0-infinate values. So you're saying "this is a list", however this is not the case. Saying something is something it's not is confusing. 2) Ignore that, I changed the code but didn't put it back fully. 3) Range starts at 0 as well, what's the problem? 4) This is confusing when you had the if afterwards. 5) With time nested comprehensions like the one above are really easy to read. The underlying for loops may be harder to understand, but that'd be the same if it were not a comprehension either. — user226435
– user226435, Commented Jul 4, 2020 at 20:46

RootTwo · Accepted Answer · 2020-07-04 23:50:38Z

overall structure

I suggest splitting the program into two files. Everything before forward_loop() processes the files and could be split out into a separate library. This will make it easier to test the functions as well as reuse them in other scripts.

Forward_loop() and reverse_loop() don't really seem necessary. Basically the former calls load_file() and the later calls rev(load_file()).

It's not clear what's the purpose of overlay(). If it's a typical DNA processing function it should go in the library. If it's only needed to enter data in the web form, then it should go in the main script.

The rest of the code seems to deal with processing command-line args, getting user input and doing the search using selenium. It can go in the main script, which import the library.

try argparse

Your code processes command line parameters in several places and multiple functions. Try using argparse from the standard library.

import argparse def parse_args(): parser = argparse.ArgumentParser() parser.add_argument('-f', '--forward', help="file for forward sequencing") parser.add_argument('-r', '--reverse', help="file for reverse sequencing") return parser.parse_args()

Calling it will return an object with attributes forward and reverse set to the argument or None.

It looks like you intend to let the user pick multiple sequences for the search. That can be split into another function. Also, doc strings are good.

def get_selection(sequences): """Lets the user select a subset of sequences from a list of sequences. Prints the sequences, one per row, with a row number and prompts the user to enter a space separated list or row numbers. Returns a list of the selected sequences or an empty list. """ print(f'row sequence') for number, sequence in enumerate(sequences, 1)): print(f'{number:3} {sequence}') print('To select sequences for the search, enter the' 'row numbers separates by spaces, e.g,. 0 2 3' ) picks = input(': ').strip() return [sequence[int(i)] for i in picks.split()] if picks else [] def get_sequences(args): if args.forward: codons = load_file(args.forward) sequences = find_open_reading_frames(codons) forward_sequences = get_selection(sequences) if args.reverse: codons = rev(load_file(args.reverse)) sequences = find_open_reading_frames(codons) reverse_sequences = get_selection(sequences) return forward_sequences, reverse_sequences def main(): args = parse_args() forward_sequences, reverse_sequences = get_sequences(args) driver = webdriver.Chrome() driver.get('https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome') fill_box = driver.find_element_by_xpath('/html/body/div[2]/div/div[2]/form/div[3]/fieldset/div[1]/div[1]/textarea') fill_box.clear() fill_box.send_keys(overlay(forward_sequences, reverse_sequences)) submit_button=driver.find_element_by_xpath( '/html/body/div[2]/div/div[2]/form/div[6]/div/div[1]/div[1]/input' ) submit_button.click() main()

I've run out of time, so this isn't tested. Hopefully you get the idea.

Stack Exchange Network

DNA Translator and Verifier (using BLAST)

2 Answers 2

overall structure

try argparse

You must log in to answer this question.

Hot Network Questions

DNA Translator and Verifier (using BLAST)

2 Answers 2

overall structure

try argparse

You must log in to answer this question.

Related

Hot Network Questions