This software calculates a confidence interval for the mutation rate from a set of observed containment indices under a simple nucleotide mutation process. In this model, a sequence B evolves from a sequence A by independently mutating every nucleotide with probability p. We then observe the scaled containment indices as a result of the mutation process. The software then generates a confidence interval for the mutation rate p.
We recommend installation within an isolated conda environment. You can create a conda environment with all dependencies installed like so:
git clonethis repo andcdin:
git clone https://github.com/KoslickiLab/mutation-rate-ci-calculator cd mutation-rate-ci-calculator - Use
condaormambato create themrccenvironment with all dependencies installed:
conda env create --file environment.yml If you have
mambainstalled, feel free to use it here to speed things up. If you haven't installedcondaormamba, install it first via Miniforge or Miniconda.
- Activate the virtual environment and install this software with pip:
conda activate mrcc pip install -e '.' You should now be able to run the help for p_from_scaled_containment:
python -m mrcc.p_from_scaled_containment --help To compute a p confidence interval from an observed number of scaled containment indices:
python -m mrcc.p_from_scaled_containment -L 100K -k 21 -c 0.95 --sccon 0.10605 L k conf Cks CLow CHigh pLow pHigh 100000 21 0.95 0.10605 0.10046 0.11191 0.09623 0.10655In the example above, the L, k, conf, and Cks values are the input number of k-mers, k-mer size, desired confidence level, and observed scaled MinHash value respectively. The CLow and CHigh give the left and right 95% confidence interval around the true containment value. The pLow and pHigh give a confidence interval that is at least as wide as the 95% confidence interval around the true mutation rate p. Note that average nucleotide identity (ANI) is equal to 1-p.
In reality, you may not know L. In such cases, we recommend that you estimate it from what you know. For example, if what you know is that the number of distinct (i.e. counting duplicates only once) k-mers in A is nA and in B is nB, then you can set L = (nA + nB) / 2. You can also try setting L = min(nA, nB) or L = max(nA, nB).
You may also want to get a confidence interval on p from the number of mutated k-mers N, but you might only known the number of shared k-mers, i.e. the number of k-mers in both A and B. If this number is n, then you can set N = L - n.
Note that the programs consider L (uppercase) as the number of kmers in the sequence, and l (lowercase) as the number of nucleotides, with l = L + k-1.
- python3
- scipy
- numpy
For computing confidence intervals, only scipy is required. An optional module, mpmath, will be used if present (as described below).
Numpy is only used by the simulation programs.
Two addition packages are used if present: mpmath and mmh3.
mpmath is a multi-precision package, used here to avoid numerical problems that can occur for very low mutation probabilities (e.g. 1e-8). If the module is not present standard python floating-point is used instead.
mmh3 is a wrapper for MurmurHash3, used here for hashing kmers for bottom sketches. If the module is not present the hashing options in simulate_nucleotide_errors are not available.
python -m mrcc.p-from-scaled-containment.py
Compute confidence interval for the mutation rate p, given the observed number of mutated k-mers optional arguments: -h, --help show this help message and exit --sccon SCCON [SCCON ...] observed MinHash Containment (input one or more values, separated by a space) --length LENGTH number of nucleotides in the sequence -L NUM_UNIQUE_KMERS, --num-unique-kmers NUM_UNIQUE_KMERS number of unique k-mers in the sequence --scaled SCALED scaling factor of the sketch -k KSIZE, --ksize KSIZE kmer size -c CONFIDENCE, --confidence CONFIDENCE size of confidence interval, (value between 0 and 1) -s SEED, --seed SEED random seed for simulations --debug debug --debug_options [{nocache,nojmonotonicity,nsanity} [{nocache,nojmonotonicity,nsanity} ...]] Specify one or more debugging options, separated by a space