Proteins are chains of amino acids. Amino acids are coded by codons, a sequence of 3 DNA/RNA molecules. DNA also has 3 open reading frames. This is basically the DNA sequence, but shift it by 1 (i.e. ignore the first entry). Thus, you will have 3 different translations (no skipping, skip 1st entry, skip 2nd entry). Additionally, for some sequencing techniques, the length of the DNA they can sequence is short. Thus, you may need to sequence forward, and backwards (-f and -r in my code). Finally, these amino acids sequences start with a specific codon, and end with specific codons.
This code takes the DNA, translates it to an amino acid using the start and stop codons as borders. It offers the user 3 options, either only forward sequencing or reverse sequencing (where the dna sequence needs to be reversed, and then complemented), or a combination using both the forward and reverse. If both is picked, the script then looks for a point of intersection, and combines the forward and reverse at that intersection. Furthermore, it offers the user to pick between all the potential sequences found. Finally, it uses BLAST to search the sequence picked against a database, to confirm the identity of the protein.
A basic schematic:
#DNA AGTTGCGC #translated 1st reading frame: MC 2nd reading frame: VA 3rd reading frame: LR #since only 1st reading frame has seq that starts with M #sequence to search MC #Blast will search MC That's the basic idea.
I'm not very familiar with functions (its why I have randomly assigned globals at the bottom, it's my "cheating" way of trying to make everything work. Additionally, this is also my first time trying to design user inputs in the terminal and using those as "flags" (i.e. if user types this in, do this). In its current state its a little ugly (in both the main_loop and reverse/forward loops I have dependencies on user input and multiple nested loops).
Thus I'm looking for 2 things:
A way to clean up some of the user input lines so I don't have this multiple nested main loop. And feedback on the design/structure and use of my functions.
Is the code structured/properly is it clean? Are the methodologies used "best practices". In other words, are there better ways to do what I am attempting to do.
I'm writing this program in the aim of learning how to write longer/cleaner programs, learn how to design my program to work via terminal (instead of GUI), and an excuse to learn selenium as well (although I do think it has some practical applications as well).
To run: python script.py -f forward_file.txt -r reverse_file.txt The correct option to pick when presented with the translations is 1 and 0
from selenium import webdriver from selenium.webdriver.common.keys import Keys import sys dna_codon_dict={'TTT':'F','TTC':'F', 'TTA':'L','TTG':'L', 'CTT':'L','CTC':'L', 'CTA':'L','CTG':'L', 'ATT':'I','ATC':'I', 'ATA':'I','ATG':'M', 'GTT':'V','GTC':'V', 'GTA':'V','GTG':'V', 'TCT':'S','TCC':'S', 'TCA':'S','TCG':'S', 'CCT':'P','CCC':'P', 'CCA':'P','CCG':'P', 'ACT':'T','ACC':'T', 'ACA':'T','ACG':'T', 'GCT':'A','GCC':'A', 'GCA':'A','GCG':'A', 'TAT':'Y','TAC':'Y', 'CAT':'H','CAC':'H', 'CAA':'Q','CAG':'Q', 'AAT':'N','AAC':'N', 'AAA':'K','AAG':'K', 'GAT':'D','GAC':'D', 'GAA':'E','GAG':'E', 'TGT':'C','TGC':'C', 'TGG':'W','CGT':'R', 'CGC':'R','CGA':'R', 'CGG':'R','AGT':'S', 'AGC':'S','AGA':'R', 'AGG':'R','GGT':'G', 'GGC':'G','GGA':'G', 'GGG':'G'} DNA_complement_dict={'A':'T', 'T':'A', 'G':'C', 'C':'G', 'N':'N'} def load_file(files): codon_list=[] with open(files) as seq_result: for lines in seq_result: if lines.startswith('>') is True: continue remove_white_spaces=lines.strip().upper() for codon in remove_white_spaces: codon_list.append(codon) return codon_list def rev(files): reverse_codon_list=[] codon_list=load_file(files) codon_list.reverse() for codons in codon_list: reversed_codon=DNA_complement_dict[codons] reverse_codon_list.append(reversed_codon) return reverse_codon_list def codon_translation(global_codon_list): codon_counter=0 codon_triple_list=[] open_reading_frame_lists=[[],[],[],] for i in range(3): open_reading_frame_count=1 codon_triple_list.clear() codon_counter=0 for codons in global_codon_list: if open_reading_frame_count>=(i+1): codon_counter+=1 codon_triple_list.append(codons) if codon_counter == 3: codon_counter=0 join_codons=''.join(codon_triple_list) try: amino_acid=dna_codon_dict[join_codons] open_reading_frame_lists[i].append(amino_acid) except: pass if join_codons in {'TAA','TAG','TGA'}: open_reading_frame_lists[i].append('X') codon_triple_list.clear() else: open_reading_frame_count+=1 return open_reading_frame_lists def find_open_reading_frames(global_codon_list): sequences_to_search=[] sequence_to_add_to_search_list=[] add_to_string=False for open_reading_frames in codon_translation(global_codon_list): for amino_acids in open_reading_frames: if amino_acids == 'M': add_to_string=True if add_to_string is True: sequence_to_add_to_search_list.append(amino_acids) if amino_acids == 'X': add_to_string=False if len(sequence_to_add_to_search_list)>0: sequences_to_search.append(''.join(sequence_to_add_to_search_list)) sequence_to_add_to_search_list.clear() else: sequence_to_add_to_search_list.clear() return sequences_to_search def forward_loop(): files=sys.argv[2] forward_flag=False if sys.argv[1] == '-f': forward_flag=True if forward_flag is True: codon_list=load_file(files) return codon_list def reverse_loop(): if sys.argv[1] == '-f': revsere_flag=False try: if sys.argv[3] == '-r': files=sys.argv[4] reverse_flag=True if reverse_flag is True: codon_list=rev(files) return codon_list except: pass else: files=sys.argv[2] reverse_flag=False if sys.argv[1] == '-r': reverse_flag=True if reverse_flag is True: codon_list=rev(files) return codon_list def overlay(sequence_list1,sequence_list2): new_list1=[word for line in sequence_list1 for word in line] new_list2=[word for line in sequence_list2 for word in line] temp_list=[] modified_list1=[] counter=0 for x in new_list1: temp_list.append(x) modified_list1.append(x) counter+=1 if counter >= 5: if temp_list == new_list2[0:5]: break else: temp_list.pop((0)) del new_list2[0:5] return ''.join(modified_list1+new_list2) sequence_list1=[] sequence_list2=[] global_codon_list=[] def main_loop(): global global_codon_list global sequence_list1 global sequence_list2 if sys.argv[1] == '-f': global_codon_list=forward_loop() sequences_to_search=find_open_reading_frames(global_codon_list) sequence_to_search=[] for sequence,number in zip(sequences_to_search,range(len(sequences_to_search))): print(f'row {number} sequence: {sequence}') sequence_to_search.append(sequence) pick_sequence_to_search=input('indicate which row # sequence to search: ') sequence_list1.append(sequence_to_search[int(pick_sequence_to_search)]) try: if sys.argv[3] == '-r': global_codon_list=reverse_loop() sequences_to_search=find_open_reading_frames(global_codon_list) sequence_to_search=[] for sequence,number in zip(sequences_to_search,range(len(sequences_to_search))): print(f'row {number} sequence: {sequence}') sequence_to_search.append(sequence) pick_sequence_to_search=input('indicate which row # sequence to search: ') sequence_list2.append(sequence_to_search[int(pick_sequence_to_search)]) except: pass else: sequence_to_search=[] global_codon_list=reverse_loop() sequences_to_search=find_open_reading_frames(global_codon_list) for sequence,number in zip(sequences_to_search,range(len(sequences_to_search))): print(f'row {number} sequence: {sequence}') sequence_to_search.append(sequence) pick_sequence_to_search=input('indicate which row # sequence to search: ') sequence_list1.append(sequence_to_search[int(pick_sequence_to_search)]) main_loop() driver = webdriver.Chrome() driver.get('https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome') fill_box = driver.find_element_by_xpath('/html/body/div[2]/div/div[2]/form/div[3]/fieldset/div[1]/div[1]/textarea') fill_box.clear() fill_box.send_keys(overlay(sequence_list1,sequence_list2)) sumbit_button=driver.find_element_by_xpath('/html/body/div[2]/div/div[2]/form/div[6]/div/div[1]/div[1]/input') sumbit_button.click() #DNA forward >Delta_fl_pETDuet_1F NNNNNNNNNNNNNNNNANTTAATACGACTCACTATAGGGGAATTGTGAGCGGATAACAATTCCCCTCTAGAAATAATTTT GTTTAACTTTAAGAAGGAGATATACCATGGGCAGCAGCCATCACCATCATCACCACAGCCAGGATCCAATGATTCGGTTG TACCCGGAACAACTCCGCGCGCAGCTCAATGAAGGGCTGCGCGCGGCGTATCTTTTACTTGGTAACGATCCTCTGTTATT GCAGGAAAGCCAGGACGCTGTTCGTCAGGTAGCTGCGGCACAAGGATTCGAAGAACACCACACTTTTTCCATTGATCCCA ACACTGACTGGAATGCGATCTTTTCGTTATGCCAGGCTATGAGTCTGTTTGCCAGTCGACAAACGCTATTGCTGTTGTTA CCAGAAAACGGACCGAATGCGGCGATCAATGAGCAACTTCTCACACTCACCGGACTTCTGCATGACGACCTGCTGTTGAT CGTCCGCGGTAATAAATTAAGCAAAGCGCAAGAAAATGCCGCCTGGTTTACTGCGCTTGCGAATCGCAGCGTGCAGGTGA CCTGTCAGACACCGGAGCAGGCTCAGCTTCCCCGCTGGGTTGCTGCGCGCGCAAAACAGCTCAACTTAGAACTGGATGAC GCGGCAAATCAGGTGCTCTGCTACTGTTATGAAGGTAACCTGCTGGCGCTGGCTCAGGCACTGGAGCGTTTATCGCTGCT CTGGCCAGACGGCAAATTGACATTACCGCGCGTTGAACAGGCGGTGAATGATGCCGCGCATTTCACCCCTTTTCATTGGG TTGATGCTTTGTTGATGGGAAAAAGTAAGCGCGCATTGCATATTCTTCAGCAACTGCGTCTGGAAGGCAGCGAACCGGTT ATTTTGTTGCGCACATTAN #DNA Reverse >Delta_FL_pETDuet_R-T7-Term_B12.ab1 NNNNNNNNNNNNNAGCTGCGCTAGTAGACGAGTCCATGTGCTGGCGTTCAAATTTCGCAGCAGCGGTTTCTTTACCAGAC TCGAGTTAACCGTCGATAAATACGTCCGCCAGGGGTTTATGGCACAACAGAAGAGATAACCCTTCCAGCTCTGCCCACAC TGACTGACCGTAATCTTGTTTGAGGGTGAGTTCCGTTCGTGTCAGGAGTTGCACGGCCTGACGTAACTGCGTCTGACTTA AGCGATTTAACGCCTCGCCCATCATGCCCCGGCGGTTCTGCCATACCCGATGCTTATCAAACAACGCACGCAGTGGCGTA TGGGCAGACTGGCGTTTCAGGTTAACCAGTAACAACAGTTCACGTTGTAATGTGCGCAACAAAATAACCGGTTCGCTGCC TTCCAGACGCAGTTGCTGAAGAATATGCAATGCGCGCTTACTTTTTCCCATCAACAAAGCATCAACCCAATGAAAAGGGG TGAAATGCGCGGCATCATTCACCGCCTGTTCAACGCGCGGTAATGTCAATTTGCCGTCTGGCCAGAGCAGCGATAAACGC TCCAGTGCCTGAGCCAGCGCCAGCAGGTTACCTTCATAACAGTAGCAGAGCACCTGATTTGCCGCGTCATCCAGTTCTAA GTTGAGCTGTTTTGCGCGCGCAGCAACCCAGCGGGGAAGCTGAGCCTGCTCCGGTGTCTGACAGGTCACCTGCACGCTGC GATTCGCAAGCGCAGTAAACCACGCGGCATTTTCTTGCGCTTTGCTTAATTTATTACCGCGGACGATCAACAGCNNNCGT CATGCAGAAGTCCGGTGAGTGTGAGAAGTTGCTCATNGATCGCCCGCATTCGGNCCGTTTTCTGGTANCANCAGNNATAC CGTTTGTCGANTGGCAAACANACN