Skip to content

maximtrp/mchmm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mchmm

Documentation Issues Coverage Codacy Downloads PyPi

A Python package for statistical modeling with Markov chains and Hidden Markov models. Built on NumPy and SciPy, mchmm provides efficient implementations of core algorithms including Viterbi decoding and Baum-Welch parameter estimation. The package also includes visualization capabilities for understanding model structure and behavior.

Key Features

  • Discrete Markov Chains: Build transition models from sequence data with automatic state inference
  • Hidden Markov Models: Implement HMMs with customizable observation and state spaces
  • Viterbi Algorithm: Find most likely state sequences for new observations
  • Baum-Welch Algorithm: Learn HMM parameters from unlabeled sequence data
  • Statistical Testing: Built-in chi-squared tests for model validation
  • Visualization: Generate directed graphs of Markov models using Graphviz
  • Simulation: Generate synthetic sequences from trained models

Donate

If you find this package useful, please consider donating any amount of money. This will help me spend more time on supporting open-source software.

Buy Me A Coffee

Installation

Install from PyPI:

pip install mchmm

Or install from source:

git clone https://github.com/maximtrp/mchmm.git cd mchmm pip install . --user

Dependencies

  • NumPy - Numerical computing
  • SciPy - Scientific computing and statistics
  • Graphviz - Graph visualization (optional)

Quick Start

Discrete Markov Chains

Initializing a Markov chain using some data.

import mchmm as mc a = mc.MarkovChain().from_data('AABCABCBAAAACBCBACBABCABCBACBACBABABCBACBBCBBCBCBCBACBABABCBCBAAACABABCBBCBCBCBCBCBAABCBBCBCBCCCBABCBCBBABCBABCABCCABABCBABC')

Now, we can look at the observed transition frequency matrix:

a.observed_matrix # array([[ 7., 18., 7.], # [19., 5., 29.], # [ 5., 30., 3.]])

And the observed transition probability matrix:

a.observed_p_matrix # array([[0.21875 , 0.5625 , 0.21875 ], # [0.35849057, 0.09433962, 0.54716981], # [0.13157895, 0.78947368, 0.07894737]])

You can visualize your Markov chain. First, build a directed graph with graph_make() method of MarkovChain object. Then render() it.

graph = a.graph_make( format="png", graph_attr=[("rankdir", "LR")], node_attr=[("fontname", "Roboto bold"), ("fontsize", "20")], edge_attr=[("fontname", "Iosevka"), ("fontsize", "12")] ) graph.render()

Here is the result:

Markov Chain

Pandas can help us annotate columns and rows:

import pandas as pd pd.DataFrame(a.observed_matrix, index=a.states, columns=a.states, dtype=int) # A B C # A 7 18 7 # B 19 5 29 # C 5 30 3

Viewing the expected transition frequency matrix:

a.expected_matrix # array([[ 8.06504065, 13.78861789, 10.14634146], # [13.35772358, 22.83739837, 16.80487805], # [ 9.57723577, 16.37398374, 12.04878049]])

Calculating Nth order transition probability matrix:

a.n_order_matrix(a.observed_p_matrix, order=2) # array([[0.2782854 , 0.34881028, 0.37290432], # [0.1842357 , 0.64252707, 0.17323722], # [0.32218957, 0.21081868, 0.46699175]])

Carrying out a chi-squared test:

a.chisquare(a.observed_matrix, a.expected_matrix, axis=None) # Power_divergenceResult(statistic=47.89038802624337, pvalue=1.0367838347591701e-07)

Finally, let's simulate a Markov chain given our data.

ids, states = a.simulate(10, start='A', seed=np.random.randint(0, 10, 10)) ids # array([0, 2, 1, 0, 2, 1, 0, 2, 1, 0]) states # array(['A', 'C', 'B', 'A', 'C', 'B', 'A', 'C', 'B', 'A'], dtype='<U1') "".join(states) # 'ACBACBACBA'

Hidden Markov Models

Build HMMs from paired observation and state sequences. This example uses a DNA fragment with TATA box annotation:

import mchmm as mc obs_seq = 'AGACTGCATATATAAGGGGCAGGCTG' sts_seq = '00000000111111100000000000' a = mc.HiddenMarkovModel().from_seq(obs_seq, sts_seq)

Unique states and observations are automatically inferred:

a.states # ['0' '1'] a.observations # ['A' 'C' 'G' 'T']

The transition probability matrix for all states can be accessed using tp attribute:

a.tp # [[0.94444444 0.05555556] # [0.14285714 0.85714286]]

There is also ep attribute for the emission probability matrix for all states and observations.

a.ep # [[0.21052632 0.21052632 0.47368421 0.10526316] # [0.57142857 0. 0. 0.42857143]]

Converting the emission matrix to Pandas DataFrame:

import pandas as pd pd.DataFrame(a.ep, index=a.states, columns=a.observations) # A C G T # 0 0.210526 0.210526 0.473684 0.105263 # 1 0.571429 0.000000 0.000000 0.428571

Directed graph of the hidden Markov model:

Hidden Markov Model

Graph can be visualized using graph_make method of HiddenMarkovModel object:

graph = a.graph_make( format="png", graph_attr=[("rankdir", "LR"), ("ranksep", "1"), ("rank", "same")] ) graph.render()

Viterbi Algorithm

Decode the most likely state sequence for new observations:

new_obs = "GGCATTGGGCTATAAGAGGAGCTTG" vs, vsi = a.viterbi(new_obs) # states sequence print("VI", "".join(vs)) # observations print("NO", new_obs) # VI 0000000001111100000000000 # NO GGCATTGGGCTATAAGAGGAGCTTG

Baum-Welch Algorithm

Learn HMM parameters from unlabeled sequence data using expectation-maximization:

obs_seq = 'AGACTGCATATATAAGGGGCAGGCTG' a = hmm.HiddenMarkovModel().from_baum_welch(obs_seq, states=['0', '1']) # training log: KL divergence values for all iterations a.log # { # 'tp': [0.008646969455670256, 0.0012397829805491124, 0.0003950986109761759], # 'ep': [0.09078874423746826, 0.0022734816599056084, 0.0010118204023946836], # 'pi': [0.009030829793043593, 0.016658391248503462, 0.0038894983546756065] # }

The inferred transition (tp), emission (ep) probability matrices and initial state distribution (pi) can be accessed as shown:

a.ep, a.tp, a.pi

This model can be decoded using Viterbi algorithm:

new_obs = "GGCATTGGGCTATAAGAGGAGCTTG" vs, vsi = a.viterbi(new_obs) print("VI", "".join(vs)) print("NO", new_obs) # VI 0011100001111100000001100 # NO GGCATTGGGCTATAAGAGGAGCTTG