Solve the phase state between two haplotype blocks using markov transition probabilities

Question

I have spent about more than a year in python, but I come from biology background. I should say I have got some understanding of for-loop and nested-for-loop to workout the solution to the problem I have had. But, I suppose that there are better and comprehensive way of solving problem.

Below is the code I wrote to solve the haplotype phase state between two haplotype blocks.

Here is a short snipped of my data:

chr pos ms01e_PI ms01e_PG_al ms02g_PI ms02g_PG_al ms03g_PI ms03g_PG_al ms04h_PI ms04h_PG_al ms05h_PI ms05h_PG_al ms06h_PI ms06h_PG_al 2 15881764 4 C|T 6 C|T 7 T|T 7 T|T 7 C|T 7 C|T 2 15881767 4 C|C 6 T|C 7 C|C 7 C|C 7 T|C 7 C|C 2 15881989 4 C|C 6 A|C 7 C|C 7 C|C 7 A|T 7 A|C 2 15882091 4 G|T 6 G|T 7 T|A 7 A|A 7 A|T 7 A|C 2 15882451 4 C|T 4 T|C 7 T|T 7 T|T 7 C|T 7 C|A 2 15882454 4 C|T 4 T|C 7 T|T 7 T|T 7 C|T 7 C|T 2 15882493 4 C|T 4 T|C 7 T|T 7 T|T 7 C|T 7 C|T 2 15882505 4 A|T 4 T|A 7 T|T 7 T|T 7 A|C 7 A|T

Problem:

In the given data I want to solve the phase state of haplotype for the sample ms02g. The suffix PI represents the phased block (i.e all data within that block is phased).

For sample ms02g C-T-A-G is phased (PI level 6), so is T-C-C-T. Similarly data within PI level 4 is also phased. But, since the haplotype is broken in two levels, we don't know which phase from level-6 goes with which phase of level-4.

ms02g_PI ms02g_PG_al 6 C|T 6 T|C 6 A|C 6 G|T ×——————————×—————> Break Point 4 T|C 4 T|C 4 T|C 4 T|A

Since, all other samples are completely phased I can run a markov-chain transition probabilities to solve the phase state. To the human eye/mind you can clearly see and say that left part of ms02g (level 6, i.e C-T-A-G is more likely to go with right block of level 4 C-C-C-A).

Calculation:

Step 01:

I prepare required haplotype configuration:

The top haplotype-PI is block01 and bottom is block-02.
The left phased haplotype within each block is Hap-A and the right is Hap-B.

So, there are 2 haplotype configurations:

Block01-HapA with Block02-HapA, so B01-HapB with B02-HapB
Block01-HapA with Block02-HapB, so B01-HapB with B02-HapA

I count the number of transition for each nucleotide of PI-6 to each nucleotide of PI-4 for each haplotype configuration.

 Possible transitions ms02g_PG_al │ ┌┬┬┬ C|T │ ││││ T|C │ ││││ A|C │ ││││ G|T └─────> ││││ ×—————> Break Point │││└ T|C ││└─ T|C │└── T|C └─── T|A

and multiply the transition probabilities for first nucleotide in PI-6 to all nucleotides of PI-4. Then similarly multiply the transition probability for 2nd nucleotide of PI-6 to all nucleotides in PI-4.
then I sum the transition probabilities from one level to another.

 Possible transitions ms02g_PG_al │ ┌┬┬┬ C|T \ │ ││││ T|C | Block 1 │ ││││ A|C | │ ││││ G|T / └─────> ││││ ×—————> Break Point │││└ T|C \ ││└─ T|C | │└── T|C | Block 2 └─── T|A / ↓ ↓ Hap A Hap B

 Block-1-Hap-A with Block-2-Hap-B vs. Block-1-Hap-A with Block-2-Hap-A C|C × C|C × C|C × A|C = 0.58612 T|C × T|C × T|C × T|C = 0.000244 + C|T × C|T × C|T × A|T = 0.3164 + T|T × T|T × T|T × T|T = 0.00391 + C|A × C|A × C|A × A|A = 0.482 + T|A × T|A × T|A × T|A = 0.0007716 + C|G × C|G × C|G × A|G = 0.3164 + T|G × T|G × T|G × T|G = 0.00391 ——————— ————————— Avg = 0.42523 Avg = 0.002207 Likelyhood Ratio = 0.42523 / 0.002207 = 192.67

I have the following working code with python3: I think the transition probs can be mostly optimized using numpy, or using a markov-chain package. But, I was not able to work them out in the way I desired. Also, it took me sometime to get to what I wanted to do.

Any suggestions with explanation. Thanks.

#!/home/bin/python from __future__ import division import collections from itertools import product from itertools import islice import itertools import csv from pprint import pprint ''' function to count the number of transitions ''' def sum_Probs(pX_Y, pX): try: return float(pX_Y / pX) except ZeroDivisionError: return 0 with open("HaploBlock_Updated.txt", 'w') as f: ''' Write header (part 01): This is just a front part of the header. The rear part of the header is update dynamically depending upon the number of samples ''' f.write('\t'.join(['chr', 'pos', 'ms02g_PI', 'ms02g_PG_al', '\n'])) with open('HaploBlock_for_test-forHMMtoy02.txt') as Phased_Data: ''' Note: Create a dictionary to store required data. The Dict should contain information about two adjacent haplotype blocks that needs extending. In this example I want to extend the haplotypes for sample ms02g which has two blocks 6 and 4. So, I read the PI and PG value for this sample. Also, data should store with some unique keys. Some keys are: chr, pos, sample (PI, PG within sample), possible haplotypes ... etc .. ''' Phased_Dict = csv.DictReader(Phased_Data, delimiter='\t') grouped = itertools.groupby(Phased_Dict, key=lambda x: x['ms02g_PI']) ''' Function to read the data as blocks (based on PI values of ms02g) at a time''' def accumulate(data): acc = collections.OrderedDict() for d in data: for k, v in d.items(): acc.setdefault(k, []).append(v) return acc ''' Store data as keys,values ''' grouped_data = collections.OrderedDict() for k, g in grouped: grouped_data[k] = accumulate(g) ''' Clear memory ''' del Phased_Dict del grouped ''' Iterate over two Hap Blocks at once. This is done to obtain all possible Haplotype configurations between two blocks. The (keys,values) for first block is represented as k1,v2 and for the later block as k2,v2. ''' # Empty list for storing haplotypes from each block hap_Block1_A = []; hap_Block1_B = [] hap_Block2_A = []; hap_Block2_B = [] # Create empty list for possible haplotype configurations from above block hapB1A_hapB2A = [] hapB1B_hapB2B = [] ''' list of all available samples (index, value) ''' sample_list = [('ms01e_PI', 'ms01e_PG_al'), ('ms02g_PI', 'ms02g_PG_al'), ('ms03g_PI', 'ms03g_PG_al'), ('ms04h_PI', 'ms04h_PG_al'), ('ms05h_PI', 'ms05h_PG_al'), ('ms06h_PI', 'ms06h_PG_al')] ''' read data from two consecutive blocks at a time ''' for (k1, v1), (k2, v2) in zip(grouped_data.items(), islice(grouped_data.items(), 1, None)): ''' skip if one of the keys has no values''' if k1 == '.' or k2 == '.': continue ''' iterate over the first Haplotype Block, i.e the k1 block. The nucleotides in the left of the phased SNPs are called Block01-haplotype-A, and similarly on the right as Block01-haplotype-B. ''' hap_Block1_A = [x.split('|')[0] for x in v1['ms02g_PG_al']] # the left haplotype of Block01 hap_Block1_B = [x.split('|')[1] for x in v1['ms02g_PG_al']] ''' iterate over the second Haplotype Block, i.e the k2 block ''' hap_Block2_A = [x.split('|')[0] for x in v2['ms02g_PG_al']] hap_Block2_B = [x.split('|')[1] for x in v2['ms02g_PG_al']] ''' Now, we have to start to solve the possible haplotype configuration. Possible haplotype Configurations will be, Either : 1) Block01-haplotype-A phased with Block02-haplotype-A, creating -> hapB1A-hapB2A, hapB1B-hapB2B Or, 2) Block01-haplotype-A phased with Block02-haplotype-B creating -> hapB1A-hapB2B, hapB1B-hapB2A ''' ''' First possible configuration ''' hapB1A_hapB2A = [hap_Block1_A, hap_Block2_A] hapB1B_hapB2B = [hap_Block1_B, hap_Block2_B] ''' Second Possible Configuration ''' hapB1A_hapB2B = [hap_Block1_A, hap_Block2_B] hapB1B_hapB2A = [hap_Block1_B, hap_Block2_A] print('\nStarting MarkovChains') ''' Now, start preping the first order markov transition matrix for the observed haplotypes between two blocks.''' ''' To store the sum-values of the product of the transition probabilities. These sum are added as the product-of-transition is retured by nested for-loop; from "for m in range(....)" ''' Sum_Of_Product_of_Transition_probabilities_hapB1A_hapB2A = \ sumOf_PT_hapB1A_B2A = 0 Sum_Of_Product_of_Transition_probabilities_hapB1B_hapB2B = \ sumOf_PT_hapB1B_B2B = 0 Sum_Of_Product_of_Transition_probabilities_hapB1A_hapB2B = \ sumOf_PT_hapB1A_B2B = 0 Sum_Of_Product_of_Transition_probabilities_hapB1B_hapB2A = \ sumOf_PT_hapB1B_B2A = 0 for n in range(len(v1['ms02g_PI'])): # n-ranges from block01 ''' Set the probabilities of each nucleotides at Zero. They are updated for each level of "n" after reading the number of each nucleotides at that position. ''' pA = 0; pT = 0; pG = 0; pC = 0 # nucleotide_prob = [0., 0., 0., 0.] # or store as numpy matrix # Also storing these values as Dict # probably can be improved nucleotide_prob_dict = {'A': 0, 'T': 0, 'G': 0, 'C': 0} print('prob as Dict: ', nucleotide_prob_dict) ''' for storing the product-values of the transition probabilities. These are updated for each level of "n" paired with each level of "m". ''' product_of_transition_Probs_hapB1AB2A = POTP_hapB1AB2A = 1 product_of_transition_Probs_hapB1BB2B = POTP_hapB1BB2B = 1 product_of_transition_Probs_hapB1AB2B = POTP_hapB1AB2B = 1 product_of_transition_Probs_hapB1BB2A = POTP_hapB1BB2A = 1 ''' Now, we read each level of "m" to compute the transition from each level of "n" to each level of "m". ''' for m in range(len(v2['ms02g_PI'])): # m-ranges from block02 # transition for each level of 0-to-n level of V1 to 0-to-m level of V2 ''' set transition probabilities at Zero for each possible transition ''' pA_A = 0; pA_T = 0; pA_G = 0; pA_C = 0 pT_A = 0; pT_T = 0; pT_G = 0; pT_C = 0 pG_A = 0; pG_T = 0; pG_G = 0; pG_C = 0 pC_A = 0; pC_T = 0; pC_G = 0; pC_C = 0 ''' Also, creating an empty dictionary to store transition probabilities - probably upgrade using numpy ''' transition_prob_dict = {'A_A' : 0, 'A_T' : 0, 'A_G' : 0, 'A_C' : 0, 'T_A' : 0, 'T_T' : 0, 'T_G' : 0, 'T_C' : 0, 'G_A' : 0, 'G_T' : 0, 'G_G' : 0, 'G_C' : 0, 'C_A' : 0, 'C_T' : 0, 'C_G' : 0, 'C_C' : 0} ''' Now, loop through each sample to compute initial probs and transition probs ''' for x, y in sample_list: print('sample x and y:', x,y) ''' Update nucleotide probability for this site - only calculated from v1 and only once for each parse/iteration ''' if m == 0: pA += (v1[y][n].count('A')) pT += (v1[y][n].count('T')) pG += (v1[y][n].count('G')) pC += (v1[y][n].count('C')) nucleotide_prob_dict['A'] = pA nucleotide_prob_dict['T'] = pT nucleotide_prob_dict['G'] = pG nucleotide_prob_dict['C'] = pC nucleotide_prob = [pA, pT, pG, pC] ''' Now, update transition matrix ''' nucl_B1 = (v1[y][n]).split('|') # nucleotides at Block01 nucl_B2 = (v2[y][m]).split('|') # nucleotides at Block02 ''' create possible haplotype configurations between "n" and "m". If the index (PI value) are same we create zip, if index (PI value) are different we create product. ''' HapConfig = [] # create empty list if v1[x][n] == v2[x][m]: ziped_nuclB1B2 = [(x + '_' + y) for (x,y) in zip(nucl_B1, nucl_B2)] HapConfig = ziped_nuclB1B2 ''' Move this counting function else where ''' pA_A += (HapConfig.count('A_A')) # (v1[y][0].count('A'))/8 pA_T += (HapConfig.count('A_T')) pA_G += (HapConfig.count('A_G')) pA_C += (HapConfig.count('A_C')) pT_A += (HapConfig.count('T_A')) # (v1[y][0].count('A'))/8 pT_T += (HapConfig.count('T_T')) pT_G += (HapConfig.count('T_G')) pT_C += (HapConfig.count('T_C')) pG_A += (HapConfig.count('G_A')) # (v1[y][0].count('A'))/8 pG_T += (HapConfig.count('G_T')) pG_G += (HapConfig.count('G_G')) pG_C += (HapConfig.count('G_C')) pC_A += (HapConfig.count('C_A')) pC_T += (HapConfig.count('C_T')) pC_G += (HapConfig.count('C_G')) pC_C += (HapConfig.count('C_C')) if v1[x][n] != v2[x][m]: prod_nuclB1B2 = [(x + '_' + y) for (x,y) in product(nucl_B1, nucl_B2)] HapConfig = prod_nuclB1B2 print('prod NuclB1B2: ', prod_nuclB1B2) pA_A += (HapConfig.count('A_A'))/2 pA_T += (HapConfig.count('A_T'))/2 pA_G += (HapConfig.count('A_G'))/2 pA_C += (HapConfig.count('A_C'))/2 pT_A += (HapConfig.count('T_A'))/2 pT_T += (HapConfig.count('T_T'))/2 pT_G += (HapConfig.count('T_G'))/2 pT_C += (HapConfig.count('T_C'))/2 pG_A += (HapConfig.count('G_A'))/2 pG_T += (HapConfig.count('G_T'))/2 pG_G += (HapConfig.count('G_G'))/2 pG_C += (HapConfig.count('G_C'))/2 pC_A += (HapConfig.count('C_A'))/2 pC_T += (HapConfig.count('C_T'))/2 pC_G += (HapConfig.count('C_G'))/2 pC_C += (HapConfig.count('C_C'))/2 ''' Now, compute nucleotide and transition probabilities for each nucleotide from each 0-n to 0-m at each sample. This updates the transition matrix in each loop. **Note: At the end this transition probabilities should sum to 1 ''' ''' Storing nucleotide probabilities ''' nucleotide_prob = [pA, pT, pG, pC] ''' Storing transition probability as dict''' transition_prob_dict['A_A'] = sum_Probs(pA_A,pA) transition_prob_dict['A_T'] = sum_Probs(pA_T,pA) transition_prob_dict['A_G'] = sum_Probs(pA_G,pA) transition_prob_dict['A_C'] = sum_Probs(pA_C,pA) transition_prob_dict['T_A'] = sum_Probs(pT_A,pT) transition_prob_dict['T_T'] = sum_Probs(pT_T,pT) transition_prob_dict['T_G'] = sum_Probs(pT_G,pT) transition_prob_dict['T_C'] = sum_Probs(pT_C,pT) transition_prob_dict['G_A'] = sum_Probs(pG_A,pG) transition_prob_dict['G_T'] = sum_Probs(pG_T,pG) transition_prob_dict['G_G'] = sum_Probs(pG_G,pG) transition_prob_dict['G_C'] = sum_Probs(pG_C,pG) transition_prob_dict['C_A'] = sum_Probs(pC_A,pC) transition_prob_dict['C_T'] = sum_Probs(pC_T,pC) transition_prob_dict['C_G'] = sum_Probs(pC_G,pC) transition_prob_dict['C_C'] = sum_Probs(pC_C,pC) '''Now, we start solving the haplotype configurations after we have the required data (nucleotide and transition probabilities). Calculate joint probability for each haplotype configuration. Step 01: We calculate the product of the transition from one level of "n" to several levels of "m". ''' ''' finding possible configuration between "n" and "m". ''' hapB1A_hapB2A_transition = hapB1A_hapB2A[0][n] + '_' + hapB1A_hapB2A[1][m] hapB1B_hapB2B_transition = hapB1B_hapB2B[0][n] + '_' + hapB1B_hapB2B[1][m] hapB1A_hapB2B_transition = hapB1A_hapB2B[0][n] + '_' + hapB1A_hapB2B[1][m] hapB1B_hapB2A_transition = hapB1B_hapB2A[0][n] + '_' + hapB1B_hapB2A[1][m] ''' computing the products of transition probabilities on the for loop ''' POTP_hapB1AB2A *= transition_prob_dict[hapB1A_hapB2A_transition] POTP_hapB1BB2B *= transition_prob_dict[hapB1B_hapB2B_transition] POTP_hapB1AB2B *= transition_prob_dict[hapB1A_hapB2B_transition] POTP_hapB1BB2A *= transition_prob_dict[hapB1B_hapB2A_transition] ''' Step 02: sum of the product of the transition probabilities ''' sumOf_PT_hapB1A_B2A += POTP_hapB1AB2A sumOf_PT_hapB1B_B2B += POTP_hapB1BB2B sumOf_PT_hapB1A_B2B += POTP_hapB1AB2B sumOf_PT_hapB1B_B2A += POTP_hapB1BB2A ''' Step 03: Now, computing the likely hood of each haplotype configuration ''' print('computing likelyhood:') likelyhood_hapB1A_with_B2A_Vs_B2B = LH_hapB1AwB2AvsB2B = \ sumOf_PT_hapB1A_B2A / sumOf_PT_hapB1A_B2B likelyhood_hapB1B_with_B2B_Vs_B2A = LH_hapB1BwB2BvsB2A = \ sumOf_PT_hapB1B_B2B / sumOf_PT_hapB1B_B2A likelyhood_hapB1A_with_B2B_Vs_B2A = LH_hapB1AwB2BvsB2A = \ sumOf_PT_hapB1A_B2B / sumOf_PT_hapB1A_B2A likelyhood_hapB1B_with_B2A_Vs_B2B = LH_hapB1BwB2AvsB2B = \ sumOf_PT_hapB1B_B2A / sumOf_PT_hapB1B_B2B print('\nlikely hood Values: B1A with B2A vs. B2B, B1B with B2B vs. B2A') print(LH_hapB1AwB2AvsB2B, LH_hapB1BwB2BvsB2A) print('\nlikely hood Values: B1A with B2B vs. B2A, B1B with B2A vs. B2B') print(LH_hapB1AwB2BvsB2A, LH_hapB1BwB2AvsB2B) ''' Update the phase state of the block based on evidence ''' if (LH_hapB1AwB2AvsB2B + LH_hapB1BwB2BvsB2A) > \ 4*(LH_hapB1AwB2BvsB2A + LH_hapB1BwB2AvsB2B): print('Block-1_A is phased with Block-2_A') for x in range(len(v1['ms02g_PI'])): PI_value = v1['ms02g_PI'][x] # Note: so, we trasfer the PI value from ealier block to next block f.write('\t'.join([v1['chr'][x], v1['pos'][x], v1['ms02g_PI'][x], hapB1A_hapB2A[0][x] + '|' + hapB1B_hapB2B[0][x], '\n'])) for x in range(len(v2['ms02g_PI'])): f.write('\t'.join([v2['chr'][x], v2['pos'][x], PI_value, hapB1A_hapB2A[1][x] + '|' + hapB1B_hapB2B[1][x], '\n'])) elif (LH_hapB1AwB2AvsB2B + LH_hapB1BwB2BvsB2A) < \ 4 * (LH_hapB1AwB2BvsB2A + LH_hapB1BwB2AvsB2B): print('Block-1_A is phased with Block-2_B') for x in range(len(v1['ms02g_PI'])): PI_value = v1['ms02g_PI'][x] # Note: so, we trasfer the PI value from ealier block to next block... # but, the phase position are reversed f.write('\t'.join([v1['chr'][x], v1['pos'][x], v1['ms02g_PI'][x], hapB1A_hapB2A[0][x] + '|' + hapB1B_hapB2B[0][x], '\n'])) for x in range(len(v2['ms02g_PI'])): f.write('\t'.join([v2['chr'][x], v2['pos'][x], PI_value, hapB1B_hapB2B[1][x] + '|' + hapB1A_hapB2A[1][x], '\n'])) else: print('cannot solve the phase state with confidence - sorry ') for x in range(len(v1['ms02g_PI'])): f.write('\t'.join([v1['chr'][x], v1['pos'][x], v1['ms02g_PI'][x], hapB1A_hapB2A[0][x] + '|' + hapB1B_hapB2B[0][x], '\n'])) for x in range(len(v2['ms02g_PI'])): f.write('\t'.join([v2['chr'][x], v2['pos'][x], v2['ms02g_PI'][x], hapB1A_hapB2A[1][x] + '|' + hapB1B_hapB2B[1][x], '\n']))

Nice question! Thorough explanations and I like the handwritten diagrams. I hope you get some good answers. — Gareth Rees
– Gareth Rees, Commented Jan 30, 2018 at 23:17
I tried to improve readability of the table by adding spaces and converted some images into text. Feel free to revert if you think this does more harm than good. — 301_Moved_Permanently
– 301_Moved_Permanently, Commented Jan 31, 2018 at 9:26
Also in the last image, you wrote "Block-1-Hap-A with Block-2-Hap-B" vs. "Block-1-Hap-A with Block-2-Hap-A" but as I read the data below, it's the other way round ("2B with 1A" and "2A with 1A"). Does it matter in any way? — 301_Moved_Permanently
– 301_Moved_Permanently, Commented Jan 31, 2018 at 9:33
@mathias: that's true. I just happened to flip the position of haplotypes, when trying to test so three conditions. Also, thanks a lot for improving the format of the question. — everestial
– everestial, Commented Jan 31, 2018 at 17:50
I plan to answer this; there is some good stuff here and also some low-hanging fruit to improve. I would be interested in a longer dataset to run this on. This holds especially given you have the performance tag, which is generally very dependent on the actual data. — Veedrac
– Veedrac, Commented Feb 4, 2018 at 2:09

Veedrac · Accepted Answer · 2018-02-12 03:37:27Z

Before we get to the interesting stuff, we should handle some stylistic niggles. Note that PEP 8 is a de-facto style for Python code. First, imports should be sorted

import collections import csv import itertools from itertools import islice from itertools import product from pprint import pprint

You only use collections for OrderedDict, so import that directly.

Names should be either snake_case, for functions and variables, or UpperCamelCase for classes. This means

sum_probs → sum_Probs pX_Y → px_y pX → px Phased_Data → phased_data Phased_Dict → phased_dict hap_Block1_A → hap_block1_a hap_Block1_B → hap_block1_b hap_Block2_A → hap_block2_a hap_Block2_B → hap_block2_b hapB1A_hapB2A → hapb1a_hapb2a hapB1B_hapB2B → hapb1b_hapb2b hapB1A_hapB2B → hapb1a_hapb2b hapB1B_hapB2A → hapb1b_hapb2a Sum_Of_Product_of_Transition_probabilities_hapB1A_hapB2A → sum_of_product_of_transition_probabilities_hapb1a_hapb2a sumOf_PT_hapB1A_B2A → sum_of_pt_hapb1a_b2a Sum_Of_Product_of_Transition_probabilities_hapB1B_hapB2B → sum_of_product_of_transition_probabilities_hapb1b_hapb2b sumOf_PT_hapB1B_B2B → sum_of_pt_hapb1b_b2b Sum_Of_Product_of_Transition_probabilities_hapB1A_hapB2B → sum_of_product_of_transition_probabilities_hapb1a_hapb2b sumOf_PT_hapB1A_B2B → sum_of_pt_hapb1a_b2b Sum_Of_Product_of_Transition_probabilities_hapB1B_hapB2A → sum_of_product_of_transition_probabilities_hapb1b_hapb2a sumOf_PT_hapB1B_B2A → sum_of_pt_hapb1b_b2a pA → pa pT → pt pG → pg pC → pc product_of_transition_Probs_hapB1AB2A → product_of_transition_probs_hapb1ab2a POTP_hapB1AB2A → potp_hapb1ab2a product_of_transition_Probs_hapB1BB2B → product_of_transition_probs_hapb1bb2b POTP_hapB1BB2B → potp_hapb1bb2b product_of_transition_Probs_hapB1AB2B → product_of_transition_probs_hapb1ab2b POTP_hapB1AB2B → potp_hapb1ab2b product_of_transition_Probs_hapB1BB2A → product_of_transition_probs_hapb1bb2a POTP_hapB1BB2A → potp_hapb1bb2a pA_A → pa_a pA_T → pa_t pA_G → pa_g pA_C → pa_c pT_A → pt_a pT_T → pt_t pT_G → pt_g pT_C → pt_c pG_A → pg_a pG_T → pg_t pG_G → pg_g pG_C → pg_c pC_A → pc_a pC_T → pc_t pC_G → pc_g pC_C → pc_c nucl_B1 → nucl_b1 nucl_B2 → nucl_b2 HapConfig → hap_config likelyhood_hapB1A_with_B2A_Vs_B2B → likelyhood_hapb1a_with_b2a_vs_b2b LH_hapB1AwB2AvsB2B → lh_hapb1awb2avsb2b likelyhood_hapB1B_with_B2B_Vs_B2A → likelyhood_hapb1b_with_b2b_vs_b2a LH_hapB1BwB2BvsB2A → lh_hapb1bwb2bvsb2a likelyhood_hapB1A_with_B2B_Vs_B2A → likelyhood_hapb1a_with_b2b_vs_b2a LH_hapB1AwB2BvsB2A → lh_hapb1awb2bvsb2a likelyhood_hapB1B_with_B2A_Vs_B2B → likelyhood_hapb1b_with_b2a_vs_b2b LH_hapB1BwB2AvsB2B → lh_hapb1bwb2avsb2b

Some notes I will get back to later:

Consistency is important. Don't write hapB1A_hapB2A in one place, hapB1A_B2A in another and hapB1AB2A in another.
Don't use variables as comments where comments would work better.
There is a lot of structure here, like p[ATGC]_[ATGC] that suggests working at a higher level would save a lot of effort.

But enough of this for now.

First function, sum_probs.

''' function to count the number of transitions ''' def sum_probs(px_y, px): try: return float(px_y / px) except ZeroDivisionError: return 0

To be frank, this makes no sense.

You very sensibly added from __future__ import division, so the float case is useless.
You catch ZeroDivisionError and return 0, which is not explained or warned about.
Your docstring is misplaced and does not match PEP 257 guidelines.
Your docstring says you "count the number of transitions", but you do no counting, rather a single division.

Let's fix the easy issues and move it to neighbor the only place it's used so we can revisit it at a more appropriate time.

def sum_probs(px_y, px): """Return the number of transitions.""" try: return px_y / px except ZeroDivisionError: return 0 ''' Storing transition probability as dict''' transition_prob_dict['A_A'] = sum_probs(pa_a,pa) ...

We then open with a file open

with open("HaploBlock_Updated.txt", 'w') as f: ''' Write header (part 01): This is just a front part of the header. The rear part of the header is update dynamically depending upon the number of samples ''' f.write('\t'.join(['chr', 'pos', 'ms02g_PI', 'ms02g_PG_al', '\n']))

First of all, +1 for using with. However,

Almost all Python code should be encapsulated inside functions, and doing complex logic in the global scope is highly disadvised.
You don't use f yet, so it is best to only open it later. Eventually we'd like it to only wrap the part that actually cares about it.
Using strings as multiline comments is not idiomatic and rather disadvised.

You then repeat this with

with open('HaploBlock_for_test-forHMMtoy02.txt') as phased_data: ''' Note: Create a dictionary to store required data. The Dict should contain information about two adjacent haplotype blocks that needs extending. In this example I want to extend the haplotypes for sample ms02g which has two blocks 6 and 4. So, I read the PI and PG value for this sample. Also, data should store with some unique keys. Some keys are: chr, pos, sample (PI, PG within sample), possible haplotypes ... etc .. ''' // lots more logic

The file should be closed as soon as you are finished with it, and the comment should be reformatted.

# Create a dictionary to store required data. # # The Dict should contain information about two adjacent # haplotype blocks that needs extending. In this example # I want to extend the haplotypes for sample ms02g, which # has two blocks, 6 and 4, so I read the PI and PG value # for this sample. # # Also, data should be stored with unique keys. # Some keys are: chr, pos, sample (PI, PG within sample), # possible haplotypes, etc. with open('HaploBlock_for_test-forHMMtoy02.txt') as phased_data: // logic that uses phased_data // lots more logic

I am not going to mention using strings as comments again, but you do need to fix this everywhere. Remember that this does not apply to docstrings, where you should follow PEP 257.

Inside the now-first block, we have an accumulate function

''' Function to read the data as blocks (based on PI values of ms02g) at a time''' def accumulate(data): acc = OrderedDict() for d in data: for k, v in d.items(): acc.setdefault(k, []).append(v) return acc

Let's quickly work on that docstring. First of all, note that it's in the wrong place and not in the right voice.

def accumulate(data): """Read the data as blocks (based on PI values of ms02g) at a time.""" ...

Secondly, the function is generic. It does not read data. It does not care about ms02g. IT just groups data in a certain manner. That is what it does, so that is what it should say it does. Its name should reflect this.

def merge_ordereddicts(odicts): """Merge ordered dictionaries, mapping each key to lists of all values mapped to in order of occurence.""" ...

I do not believe there is any reason for these dictionaries to actually be ordered, so I would abandon that. I would also prefer short names to single letter variables.

def merge_dicts(dicts): """Merge dictionaries, mapping each key to lists of all values mapped to in order of occurence.""" merged = {} for dict_ in dicts: for key, val in dict_.items(): merged.setdefault(key, []).append(val) return merged

You then write

''' Store data as keys,values ''' grouped_data = OrderedDict() for k, g in grouped: grouped_data[k] = merge_dicts(g)

which would be nicer as a comprehension

grouped_data = OrderedDict((k, merge_dicts(g)) for k, g in grouped)

You finished off with

del phased_dict del grouped

An experienced Python programmer would probably be very averse to those lines; deleting variables should be done very sparingly. But in this context it's actually good; you have global data and you're clearing the mess up after yourself. The reason experienced programmers don't do this very much is because they know to encapsulate logic like this instead. The end result looks something like

def merge_dicts(dicts): """Merge dictionaries, mapping each key to lists of all values mapped to in order of occurence.""" merged = {} for dict_ in dicts: for key, val in dict_.items(): merged.setdefault(key, []).append(val) return merged def parse_phased_data(file): """Parse input into OrderedDict keyed by ms02g_PI. The result contains information about two adjacent haplotype blocks that needs extending. Data should be stored with unique keys. Some keys are: chr, pos, sample (PI, PG within sample), possible haplotypes, etc.""" phased_dict = csv.DictReader(phased_data, delimiter='\t') grouped = itertools.groupby(phased_dict, key=lambda x: x['ms02g_PI']) return OrderedDict((k, merge_dicts(g)) for k, g in grouped) with open('HaploBlock_for_test-forHMMtoy02.txt') as phased_data: grouped_data = parse_phased_data(phased_data)

I don't entirely understand your docstrings, so I may have mangled them a little. The final with and assignment itself will eventually be moved into another function.

Next,

hap_block1_a = []; hap_block1_b = [] hap_block2_a = []; hap_block2_b = [] # Create empty list for possible haplotype configurations from above block hapb1a_hapb2a = [] hapb1b_hapb2b = []

What;s with those semicolons? More importantly, these assigned values are never used since they always get assigned ahead of time, so they can only hide bugs. Remove them entirely!

You then

# list of all available samples (index, value) sample_list = [('ms01e_PI', 'ms01e_PG_al'), ('ms02g_PI', 'ms02g_PG_al'), ('ms03g_PI', 'ms03g_PG_al'), ('ms04h_PI', 'ms04h_PG_al'), ('ms05h_PI', 'ms05h_PG_al'), ('ms06h_PI', 'ms06h_PG_al')]

but only use this way down on what is for me now line 146,

for x, y in sample_list: ...

It is best to just move it there, though you should use a tuple instead because, being immutable, those can be built once and then reused.

# Loop through each sample to compute initial probs and transition probs. sample_list = (('ms01e_PI', 'ms01e_PG_al'), ('ms02g_PI', 'ms02g_PG_al'), ('ms03g_PI', 'ms03g_PG_al'), ('ms04h_PI', 'ms04h_PG_al'), ('ms05h_PI', 'ms05h_PG_al'), ('ms06h_PI', 'ms06h_PG_al')) for x, y in sample_list: ...

Then you go straight to the loop,

# Iterate over two Hap Blocks at once. This is done to obtain all possible Haplotype # configurations between two blocks. The (keys,values) for first block is represented as # k1,v2 and for the later block as k2,v2. for (k1, v1), (k2, v2) in zip(grouped_data.items(), islice(grouped_data.items(), 1, None)): # Skip if one of the keys has no values if k1 == '.' or k2 == '.': continue

Because all the shared variables have been removed, it is now obvious that each iteration is stateless except for the file being written into. This means we can move the entire contents of the loop into a function, though it is cleaner to keep the first if outside.

def some_function(f, k1, v1, k2, v2): """...""" with open("HaploBlock_Updated.txt", 'w') as f: # Write header (part 01): # This is just a front part of the header. # The rear part of the header is date dynamically depending upon the number of samples. f.write('\t'.join(['chr', 'pos', 'ms02g_PI', 'ms02g_PG_al', '\n'])) # Iterate over two Hap Blocks at once. This is done to obtain all possible Haplotype # configurations between two blocks. The (keys,values) for first block is represented as # k1,v2 and for the later block as k2,v2. for (k1, v1), (k2, v2) in zip(grouped_data.items(), islice(grouped_data.items(), 1, None)): # Skip if one of the keys has no values if k1 == '.' or k2 == '.': continue some_function(f, k1, v1, k2, v2)

Obviously this needs a better name than some_function(f, k1, v1, k2, v2), and better documentation than just an ellipsis. I will leave that as a task for you to figure out.

Next we have

# Iterate over the first Haplotype Block, i.e the k1 block. # The nucleotides in the left of the phased SNPs are called Block01-haplotype-A, # and similarly on the right as Block01-haplotype-B. hap_block1_a = [x.split('|')[0] for x in v1['ms02g_PG_al']] # the left haplotype of Block01 hap_block1_b = [x.split('|')[1] for x in v1['ms02g_PG_al']] # Iterate over the second Haplotype Block, # i.e the k2 block hap_block2_a = [x.split('|')[0] for x in v2['ms02g_PG_al']] hap_block2_b = [x.split('|')[1] for x in v2['ms02g_PG_al']] # Now, we have to start to solve the possible haplotype configuration. # Possible haplotype configurations are either # # 1. Block01-haplotype-A phased with Block02-haplotype-A, # creating -> hapB1A-hapB2A, hapB1B-hapB2B # # 2. Block01-haplotype-A phased with Block02-haplotype-B # creating -> hapB1A-hapB2B, hapB1B-hapB2A # First possible configuration hapb1a_hapb2a = [hap_block1_a, hap_block2_a] hapb1b_hapb2b = [hap_block1_b, hap_block2_b] # Secondd possible configuration hapb1a_hapb2b = [hap_block1_a, hap_block2_b] hapb1b_hapb2a = [hap_block1_b, hap_block2_a]

Note that the hap_block variables are only ever used inside this snippet, and all take the form

[x.split('|')[X] for x in vY['ms02g_PG_al']]

for X and Y. So let's make that a function

# Iterate over the Haplotype Block, i.e the k1 block. # The nucleotides in the left of the phased SNPs are called Block01-haplotype-A, # and similarly on the right as Block01-haplotype-B. def hap_block(side, val): return [x.split('|')[side] for x in val['ms02g_PG_al']] # Now, we have to start to solve the possible haplotype configuration. # Possible haplotype configurations are either: # 1. Block01-haplotype-A phased with Block02-haplotype-A, # creating -> hapB1A-hapB2A, hapB1B-hapB2B hapb1a_hapb2a = [hap_block(0, v1), hap_block(0, v2)] hapb1b_hapb2b = [hap_block(1, v1), hap_block(1, v2)] # 2. Block01-haplotype-A phased with Block02-haplotype-B # creating -> hapB1A-hapB2B, hapB1B-hapB2A hapb1a_hapb2b = [hap_block(0, v1), hap_block(1, v2)] hapb1b_hapb2a = [hap_block(1, v1), hap_block(0, v2)]

Even better, produce a table.

hap_b1_to_b2 = { ('a', 'a'): [hap_block(0, v1), hap_block(0, v2)], ('a', 'b'): [hap_block(0, v1), hap_block(1, v2)], ('b', 'a'): [hap_block(1, v1), hap_block(0, v2)], ('b', 'b'): [hap_block(1, v1), hap_block(1, v2)], }

But not that this is only ever used as

hap_b1_to_b2[idx][0][n] + '_' + hap_b1_to_b2[idx][1][m]

As a small nit, next you

print('\nStarting MarkovChains')

which I would find nicer as two separate print statements.

print() print('Starting MarkovChains')

You write

sum_of_product_of_transition_probabilities_hapb1a_hapb2a = \ sumof_pt_hapb1a_b2a = 0 sum_of_product_of_transition_probabilities_hapb1b_hapb2b = \ sumof_pt_hapb1b_b2b = 0 sum_of_product_of_transition_probabilities_hapb1a_hapb2b = \ sumof_pt_hapb1a_b2b = 0 sum_of_product_of_transition_probabilities_hapb1b_hapb2a = \ sumof_pt_hapb1b_b2a = 0

This is not good; don't use dead variables as comments. There are other places you do this; fix them all.

Secondly, this once again implies you should be using some kind of higher-level container.

transition_prob_sums = { ('a', 'a'): 0, ('a', 'b'): 0, ('b', 'a'): 0, ('b', 'b'): 0, }

You then start a loop

for n in range(len(v1['ms02g_PI'])): # n-ranges from block01

Idiomatically, this would look more like

for n, _ in enumerate(v1['ms02g_PI']):

Inside the first level you write

# set the probabilities of each nucleotides at zero. they are updated for each level # of "n" after reading the number of each nucleotides at that position. pa = 0; pt = 0; pg = 0; pc = 0 # nucleotide_prob = [0., 0., 0., 0.] # or store as numpy matrix # also storing these values as dict # probably can be improved nucleotide_prob_dict = {'A': 0, 'T': 0, 'G': 0, 'C': 0} print('prob as dict: ', nucleotide_prob_dict) # for storing the product-values of the transition probabilities. # these are updated for each level of "n" paired with each level of "m". potp_hapb1ab2a = 1 potp_hapb1bb2b = 1 potp_hapb1ab2b = 1 potp_hapb1bb2a = 1 # now, we read each level of "m" to compute the transition from each level of "n" # to each level of "m". for m in range(len(v2['ms02g_PI'])): # m-ranges from block02 # .. inner loop .. # step 02: sum of the product of the transition probabilities transition_prob_sums['a','a'] += potp_hapb1ab2a transition_prob_sums['b','b'] += potp_hapb1bb2b transition_prob_sums['a','b'] += potp_hapb1ab2b transition_prob_sums['b','a'] += potp_hapb1bb2a

You clearly have some idea with introducing nucleotide_prob_dict, but you fail to capitalize on it since you never do anything with the type. Let's try another approach, using only the dictionary

# Probabilities of each nucleotides start at zero, # and are updated each outer iteration after reading # the number of each nucleotide at that position. nucleotide_probs = {'A': 0, 'T': 0, 'G': 0, 'C': 0} print('prob as dict: ', nucleotide_probs) # The product-values of the transition probabilities, # updated each inner iteration. prod_transition_probs = { ('a', 'a'): 1, ('a', 'b'): 1, ('b', 'a'): 1, ('b', 'b'): 1, }

We'll come to the run-on effects of doing this later. We enter the inner loop and get another encounter,

pa_a = 0; pa_t = 0; pa_g = 0; pa_c = 0 pt_a = 0; pt_t = 0; pt_g = 0; pt_c = 0 pg_a = 0; pg_t = 0; pg_g = 0; pg_c = 0 pc_a = 0; pc_t = 0; pc_g = 0; pc_c = 0

which we can rewrite

transition_probs = { ('A','A'): 0, ('A','T'): 0, ('A','G'): 0, ('A','C'): 0, ('T','A'): 0, ('T','T'): 0, ('T','G'): 0, ('T','C'): 0, ('G','A'): 0, ('G','T'): 0, ('G','G'): 0, ('G','C'): 0, ('C','A'): 0, ('C','T'): 0, ('C','G'): 0, ('C','C'): 0, }

Again, we'll see the run-on effects later. We do, however, find issue colliding with the next value;

# also, creating an empty dictionary to store transition probabilities # - probably upgrade using numpy transition_prob_dict = {'A_A' : 0, 'A_T' : 0, 'A_G' : 0, 'A_C' : 0, 'T_A' : 0, 'T_T' : 0, 'T_G' : 0, 'T_C' : 0, 'G_A' : 0, 'G_T' : 0, 'G_G' : 0, 'G_C' : 0, 'C_A' : 0, 'C_T' : 0, 'C_G' : 0, 'C_C' : 0}

You haven't explained what makes these transition probabilities different, so the whole thing is really confusing. However, you only use this later on with sub_probs and co., so just move it down there:

transition_prob_dict = {'A_A' : 0, 'A_T' : 0, 'A_G' : 0, 'A_C' : 0, 'T_A' : 0, 'T_T' : 0, 'T_G' : 0, 'T_C' : 0, 'G_A' : 0, 'G_T' : 0, 'G_G' : 0, 'G_C' : 0, 'C_A' : 0, 'C_T' : 0, 'C_G' : 0, 'C_C' : 0} # storing transition probability as dict transition_prob_dict['A_A'] = sum_probs(transition_probs['A','A'], nucleotide_probs['A']) transition_prob_dict['A_T'] = sum_probs(transition_probs['A','T'], nucleotide_probs['A']) transition_prob_dict['A_G'] = sum_probs(transition_probs['A','G'], nucleotide_probs['A']) transition_prob_dict['A_C'] = sum_probs(transition_probs['A','C'], nucleotide_probs['A']) transition_prob_dict['T_A'] = sum_probs(transition_probs['T','A'], nucleotide_probs['T']) transition_prob_dict['T_T'] = sum_probs(transition_probs['T','T'], nucleotide_probs['T']) # ... etc ...

By this point you no longer want transition_probs, so you can just replace the name. Since everything is a data structure here, you can just implement a basic iteration.

for from_, nucleotide_prob in nucleotide_probs.items(): for to in "ATGC": transition_probs[from_, to] = sum_probs(transition_probs[from_, to], nucleotide_prob)

The following code is now

# find possible configuration between "n" and "m". hapb1a_hapb2a_transition = hap_b1_to_b2['a','a'][0][n], hap_b1_to_b2['a','a'][1][m] hapb1b_hapb2b_transition = hap_b1_to_b2['b','b'][0][n], hap_b1_to_b2['b','b'][1][m] hapb1a_hapb2b_transition = hap_b1_to_b2['a','b'][0][n], hap_b1_to_b2['a','b'][1][m] hapb1b_hapb2a_transition = hap_b1_to_b2['b','a'][0][n], hap_b1_to_b2['b','a'][1][m] # compute the products of transition probabilities on the for loop prod_transition_probs['a','a'] *= transition_probs[hapb1a_hapb2a_transition] prod_transition_probs['b','b'] *= transition_probs[hapb1b_hapb2b_transition] prod_transition_probs['a','b'] *= transition_probs[hapb1a_hapb2b_transition] prod_transition_probs['b','a'] *= transition_probs[hapb1b_hapb2a_transition]

which can also be transformed

for pair in prod_transition_probs: transition = hap_b1_to_b2[pair][0][n], hap_b1_to_b2[pair][1][m] prod_transition_probs[pair] *= transition_probs[transition]

Let us reverse a bit, to the for x, y in sample_list. We have

nucleotide_probs['A'] += v1[y][n].count('A') nucleotide_probs['T'] += v1[y][n].count('T') nucleotide_probs['G'] += v1[y][n].count('G') nucleotide_probs['C'] += v1[y][n].count('C')

which is just

for nucleotide in 'ATGC': nucleotide_probs[nucleotide] += v1[y][n].count(nucleotide)

You have another redundant assignment, with a redundant comment, to remove.

hap_config = [] # create empty list

Inside you have

ziped_nuclb1b2 = [(x + '_' + y) for (x,y) in zip(nucl_b1, nucl_b2)] hap_config = ziped_nuclb1b2 # move this counting function else where transition_probs['A','A'] += (hap_config.count('A_A')) # (v1[y][0].count('A'))/8 transition_probs['A','T'] += (hap_config.count('A_T')) transition_probs['A','G'] += (hap_config.count('A_G')) transition_probs['A','C'] += (hap_config.count('A_C')) transition_probs['T','A'] += (hap_config.count('T_A')) # (v1[y][0].count('A'))/8 transition_probs['T','T'] += (hap_config.count('T_T')) transition_probs['T','G'] += (hap_config.count('T_G')) transition_probs['T','C'] += (hap_config.count('T_C')) transition_probs['G','A'] += (hap_config.count('G_A')) # (v1[y][0].count('A'))/8 transition_probs['G','T'] += (hap_config.count('G_T')) transition_probs['G','G'] += (hap_config.count('G_G')) transition_probs['G','C'] += (hap_config.count('G_C')) transition_probs['C','A'] += (hap_config.count('C_A')) transition_probs['C','T'] += (hap_config.count('C_T')) transition_probs['C','G'] += (hap_config.count('C_G')) transition_probs['C','C'] += (hap_config.count('C_C'))

First, the hap_config variable seems rather strange to me. Secondly, more importantly, you write x + '_' + y when it would be nicer to deal in tuples to start with. Overall this whole thing can be compressed to

ziped_nuclb1b2 = list(zip(nucl_b1, nucl_b2)) for transition in transition_probs: transition_probs[transition] += ziped_nuclb1b2.count(transition)

and similar for the second run, though you should hoist the zip and use an else instead of an inverted if.

ziped_nuclb1b2 = list(zip(nucl_b1, nucl_b2)) if v1[x][n] == v2[x][m]: for transition in transition_probs: transition_probs[transition] += ziped_nuclb1b2.count(transition) else: for transition in transition_probs: transition_probs[transition] += ziped_nuclb1b2.count(transition) / 2

Scrolling down a bit, because I've lost track of any semblance of order, we can replace

# step 02: sum of the product of the transition probabilities transition_prob_sums['a','a'] += prod_transition_probs['a','a'] transition_prob_sums['b','b'] += prod_transition_probs['b','b'] transition_prob_sums['a','b'] += prod_transition_probs['a','b'] transition_prob_sums['b','a'] += prod_transition_probs['b','a']

with

for pair in transition_prob_sums: transition_prob_sums[pair] += prod_transition_probs[pair]

I'm going to leave this here for now, since I only have so much time to cover changes, though you can continue a lot of these cleanups in the same way. Other major untouched issues are around clearer high-level explanations, building the program out of separated functions, removing hardcoded paths, and moving to a matrix representation rather than a dictionary approach.

And I hope that I can some point in time. Also, I would like to bother you with other problem: stackoverflow.com/questions/48737403/… I am thinking if you can suggest me something on this. I don't mind learning but being formally a biologist, program related things happen slow for me. So, any explanations to help me learn faster is appreciative. I have looked into several parallelizing example on python but I cannot translate it to solve my problem. — everestial
– everestial, Commented Feb 12, 2018 at 4:41
Hi @veedrac : thank you for all the suggestions. As a beginner pythonist and also being informally self trained I am glad yo receive the formal check points and ideas from you. I am going to make suggested changes, making my program and explanation comprehensive. Btw, I am hoping someone can suggest me numpy/scipy implementation of my algorithm; if not I will wait couple of months to teach myself some numpy. Till then this should be enough — everestial
– everestial, Commented Feb 12, 2018 at 4:44

Stack Exchange Network

Solve the phase state between two haplotype blocks using markov transition probabilities

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Solve the phase state between two haplotype blocks using markov transition probabilities

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions