Skip to content

MoellerLabPU/AlleleFlux

Repository files navigation

AlleleFlux

Bioconda Downloads Documentation Status Ask DeepWiki License: GPL v3

AlleleFlux is a bioinformatics toolkit for analyzing allele frequency changes in metagenomic time-series data. It identifies genomic targets of natural selection in microbial communities by calculating:

  • Parallelism scores — detect parallel allele frequency changes across replicates within groups
  • Divergence scores — quantify allele frequency divergence between experimental groups
  • dN/dS ratios — measure selection pressure on genes via non-synonymous to synonymous substitution rates

These scores enable direct comparisons of evolutionary dynamics across taxa, genomes, and genes, helping identify loci under strong selection.

Installation

From Bioconda (Recommended)

# Install with conda (or mamba) conda install -c conda-forge -c bioconda alleleflux # Activate the environment conda activate alleleflux

From Source

# Clone the repository git clone https://github.com/MoellerLabPU/AlleleFlux.git cd AlleleFlux # Create environment with dependencies conda env create -f environment.yml conda activate alleleflux # Or install directly with pip pip install -e .

Input Files

File Description
Reference FASTA Combined MAG contigs (header format: <MAG_ID>.fa_<contig_ID>)
Prodigal genes Nucleotide ORF predictions (.fna) matching reference contig IDs
MAG mapping TSV with columns: mag_id, contig_id
Metadata TSV Sample info with columns: sample_id, bam_path, group, time
GTDB taxonomy gtdbtk.bac120.summary.tsv from GTDB-Tk

See Input Preparation Guide for detailed format specifications.

Quick Start

1. Initialize Configuration

# Interactive configuration wizard alleleflux init # Or copy the template manually cp $(python -c "import alleleflux; print(alleleflux.__path__[0])")/smk_workflow/config.template.yml config.yml

2. Edit Configuration

Edit config.yml with your paths and parameters. Here is the complete configuration with all options:

run_name: "alleleflux_analysis" # Input Files input: fasta_path: "" # Reference FASTA file (required) prodigal_path: "" # Prodigal nucleic acid output (.fna) metadata_path: "" # Sample metadata file gtdb_path: "" # GTDB taxonomy file mag_mapping_path: "" # MAG-to-contig mapping file # Output Directory output: root_dir: "./alleleflux_output" log_level: "INFO" # DEBUG, INFO, WARNING, ERROR # Analysis Configuration analysis: data_type: "longitudinal" # "single" or "longitudinal" allele_analysis_only: False # Skip scoring/outlier detection use_lmm: True # Linear Mixed Models use_significance_tests: True # Two-sample and single-sample tests use_cmh: True # CMH test timepoints_combinations: - timepoint: ["pre", "post"] focus: "post" groups_combinations: - ["treatment", "control"] # Quality Control quality_control: min_sample_num: 4 breadth_threshold: 0.1 coverage_threshold: 1 disable_zero_diff_filtering: False # Profiling profiling: ignore_orphans: False min_base_quality: 30 min_mapping_quality: 2 ignore_overlaps: True # Statistics statistics: filter_type: "t-test" # "t-test", "wilcoxon", or "both" preprocess_between_groups: True preprocess_within_groups: True max_zero_count: 4 p_value_threshold: 0.05 fdr_group_by_mag_id: False min_positions_after_preprocess: 1 # dN/dS Analysis dnds: p_value_column: "q_value" dn_ds_test_type: "two_sample_paired_tTest" # Compute Resources resources: threads_per_job: 16 mem_per_job: "8G" time: "24:00:00"

Configuration Parameters

Section Parameter Description
input fasta_path Reference FASTA with combined MAG contigs
prodigal_path Prodigal nucleotide predictions (.fna)
metadata_path Sample metadata TSV
gtdb_path GTDB-Tk taxonomy file
mag_mapping_path MAG-to-contig mapping TSV
analysis data_type "longitudinal" (multiple timepoints) or "single"
allele_analysis_only Skip significance tests, scoring, and outlier detection if True
use_lmm Enable Linear Mixed Models
use_significance_tests Enable two-sample/single-sample tests
use_cmh Enable Cochran-Mantel-Haenszel tests
timepoints_combinations Timepoint pairs with focus timepoint
groups_combinations Groups to compare
quality_control min_sample_num Minimum samples required per MAG
breadth_threshold Minimum coverage breadth (0-1)
coverage_threshold Minimum average coverage depth
profiling min_base_quality Minimum base quality score
min_mapping_quality Minimum mapping quality score
statistics filter_type Preprocessing filter type
p_value_threshold Significance threshold
fdr_group_by_mag_id Apply FDR correction per MAG
dnds p_value_column "min_p_value" or "q_value"
dn_ds_test_type Test type for filtering dN/dS results
resources threads_per_job Threads allocated per job
mem_per_job Memory per job (e.g., "8G", "100G")
time Wall time limit (HH:MM:SS)

See Configuration Reference for complete documentation.

3. Run the Pipeline

# Run locally alleleflux run --config config.yml --threads 16 # Dry run to preview jobs alleleflux run --config config.yml --dry-run

Running on SLURM

For HPC clusters, copy the SLURM profile from the source repository:

# Copy SLURM profile (if installed from source) cp -r $(python -c "import alleleflux; print(alleleflux.__path__[0])")/smk_workflow/slurm_profile ./ # Run with SLURM alleleflux run --config config.yml --profile ./slurm_profile

The SLURM profile automatically submits jobs via sbatch with resources from your config.

How It Works

AlleleFlux is powered by a Snakemake workflow that orchestrates the complete analysis:

Input Files Profile & QC Statistical Analysis ━━━━━━━━━━━ ━━━━━━━━━━━━ ━━━━━━━━━━━━━━━━━━━━ • Reference FASTA • Extract alleles • Two-sample tests • Prodigal genes • Quality control • LMM / CMH tests • Metadata TSV • Eligibility checks • dN/dS calculation • MAG mapping ↓ ┌─────────────────┐ │ Scoring & Viz │ ├─────────────────┤ │ • Parallelism │ │ • Divergence │ │ • Outliers │ │ • Trajectories │ └─────────────────┘ 

Pipeline Steps:

  1. Profiling — Extract allele frequencies from BAM files for each MAG
  2. Quality Control — Filter samples by coverage breadth; determine MAG eligibility
  3. Statistical Testing — Apply appropriate tests based on experimental design
  4. Scoring — Calculate parallelism/divergence scores and identify outlier genes
  5. dN/dS Analysis — Calculate evolutionary rates for genes under selection

The workflow:

  • Automatically parallelizes across samples and MAGs
  • Handles checkpointing and restarts gracefully
  • Supports local execution and HPC clusters (SLURM)
  • Tracks provenance and ensures reproducibility

Output

Results are organized in the output directory:

alleleflux_output/ ├── profiles/ # Per-sample allele frequency profiles ├── metadata/ # Per-MAG metadata tables ├── eligibility/ # MAG eligibility tables ├── allele_analysis/ # Allele frequency analysis results ├── significance_tests/ # Statistical test results (LMM, CMH, t-tests) ├── scores/ # Parallelism and divergence scores ├── outliers/ # Genes with high scores (selection targets) └── dnds/ # dN/dS analysis results 

See Output Reference for file format details.

CLI Tools

AlleleFlux provides 30+ standalone command-line tools:

# List all available tools alleleflux tools # Main commands alleleflux run --help # Run the full pipeline alleleflux init --help # Interactive configuration alleleflux info # Show installation info # Individual analysis tools alleleflux-profile --help # Profile MAGs from BAM files alleleflux-qc --help # Quality control alleleflux-scores --help # Calculate parallelism/divergence scores alleleflux-dnds-from-timepoints --help # Calculate dN/dS ratios

See CLI Reference for the complete list.

Documentation

Full documentation: alleleflux.readthedocs.io

Contributing

Contributions are welcome! See CONTRIBUTING.md for guidelines.

License

AlleleFlux is licensed under the GNU General Public License v3.0.

About

AlleleFlux: A tool to identify genomic targets of natural selection in bacterial communities

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors