Pipeline to construct species phylogenies from genome assemblies or variant call data (VCF).
flowchart TD %% ----- INPUT ----- subgraph INPUT["Input data"] A_fa["Genome assemblies (FASTA)"] A_vcf["Per-sample VCFs + reference"] A_vcf2["Multi-sample VCF"] end %% ----- BUSCO ----- subgraph BUSCO["Ortholog extraction"] B_busco["BUSCO"] end %% ----- PREPROCESSING ----- subgraph PREP["Sequence processing"] subgraph ALN["Multiple alignment"] C_aln["MAFFT / MUSCLE / PRANK"] end subgraph FLT["Trimming"] C_flt["ClipKIT / GBlocks / TrimAl"] end end %% ----- PHYLOGENY ----- subgraph PHYLO["Phylogenetic tree inference"] subgraph CONCAT["Supermatrix approach"] E_phy["IQTree / MrBayes / PHYLIP / RAxML-NG / RapidNJ"] end subgraph TREE["Multispecies coalescent"] D_ast["Astral-IV"] end end %% ----- EDGES: MAIN ----- A_fa --> B_busco A_vcf -->|"GATK FastaAlternateReferenceMaker"| B_busco B_busco --> C_aln C_aln --> C_flt C_flt -->|"Concat alignment"| E_phy C_flt -->|"IQTree per gene"| D_ast %% ----- EDGES: VCF2PHYLIP ----- A_vcf2 -. "vcf2phylip.py" .-> E_phy %% ----- STYLE ----- classDef input fill:#e8f4ff,stroke:#2b7cd3,stroke-width:1px classDef process fill:#eaf7ea,stroke:#2f9e44,stroke-width:1px classDef phylo fill:#fff4e6,stroke:#e67700,stroke-width:1px classDef optional fill:#e8f4ff,stroke:#2b7cd3,stroke-width:1px,stroke-dasharray:4 4 class A_fa,A_vcf input class B_busco,C_aln,C_flt process class D_ast,E_phy phylo class A_vcf2 optional - Ortholog extraction: BUSCO
- VCF-based reconstruction: GATK FastaAlternateReferenceMaker, vcf2phylip
- Alignment: MAFFT, MUSCLE, PRANK
- Trimming: ClipKIT, TrimAl, GBlocks
- Phylogenetic tree construction: IQTree, MrBayes, ASTRAL-IV, RapidNJ, PHYLIP, RAxML-NG
- Visualization: Etetoolkit, Matplotlib
Clone the repository or download the latest release:
git clone https://github.com/tomarovsky/BuscoClade.gitPlace genome assemblies into input/genomes/. The file prefix is used as the sample name in the output phylogeny. Supported extensions: .fasta, .fna, .fa, and their gzipped versions (.fasta.gz, .fna.gz, .fa.gz).
If you have per-sample VCFs, the pipeline can reconstruct pseudo-genome assemblies using GATK FastaAlternateReferenceMaker, which are then fed into the standard BUSCO workflow alongside any FASTA assemblies.
Place per-sample VCF files and the corresponding reference genome together into a subdirectory under input/vcf_reconstruct/. Each subdirectory is processed independently, which allows reconstructing pseudo-genomes against different references in a single run:
input/ genomes/ Species1.fasta Species2.fasta.gz vcf_reconstruct/ project_hg38/ # one reference per directory reference.fasta SampleA.vcf.gz SampleB.vcf.gz project_mm39/ # another reference reference.fasta SampleC.vcf.gz The directory name is used only for organization — the VCF file prefix determines the sample name in the output phylogeny. No additional config changes are needed; the pipeline detects subdirectories automatically.
As an alternative to the BUSCO-based pipeline, a concatenated phylip alignment can be built directly from a multi-sample VCF using vcf2phylip.py, bypassing BUSCO and sequence alignment entirely. When vcf2phylip: True is set, only this route is executed.
Place exactly one multi-sample .vcf.gz file into input/vcf2phylip/ and enable the option in the config:
input/ vcf2phylip/ all_samples.vcf.gz # exactly one multi-sample VCF vcf2phylip: TrueModify config/default.yaml (recommended: copy it and pass with --configfile). The config has four sections:
Enable or disable tools and modes:
vcf2phylip: False # set True to use vcf2phylip instead of BUSCO-based pipeline quastcore: True # assembly statistics alignment: "mafft" # 'mafft', 'muscle' or 'prank' filtration: "clipkit" # 'clipkit', 'trimal' or 'gblocks' iqtree: True astral: True rapidnj: True phylip: True raxml: True mrbayes: False # recommended to run GPU-compiled version separately draw_phylotrees: TrueKey parameters to configure before running:
BUSCO:
busco_dataset_path: Path to a pre-downloaded OrthoDB dataset (e.g."$TOOLS/busco_datasets/mammalia_odb12/").busco_options: Use"--offline"to run without internet access.busco_mode: Typically"genome".busco_blacklist: Path to a file with BUSCO IDs to exclude (optional).
Alignment (parameters passed directly to the chosen tool):
prank_params,mafft_params,muscle_params
Filtration:
clipkit_params,gblocks_params,trimal_params
Phylogenetic inference:
iqtree_params: e.g."-keep-ident -m TESTNEW -bb 1000". Add-o 'OUTGROUP'to set an outgroup.astral_params: e.g."--support 2". Add--root 'OUTGROUP'to set an outgroup.raxml_params: e.g."--model GTR+G --bs-trees 100".rapidnj_params: e.g."-b 1000".phylip_dnadist_params: Use"D\n"for Kimura 2-parameter model, or""for F84 (default).phylip_neighbor_params: Use"N"for UPGMA, or""for NJ (default).mrbayes_params,mrbayes_block: MrBayes configuration block file and extra parameters.
Visualization:
tree_visualization_params: Specify outgroup as"--outgroup OUTGROUP".
Input and output paths are defined here. The defaults are:
# Input genome_dir: "input/genomes/" vcf_reconstruct_dir: "input/vcf_reconstruct/" vcf2phylip_dir: "input/vcf2phylip/" # Output (all under results/) output_dir: "results/"It is recommended to leave the directory structure unchanged.
Per-tool Slurm settings: partition (*_queue), threads (*_threads), memory in MB (*_mem_mb), and runtime (*_time). Adjust these to match your cluster configuration. Note that BUSCO and PRANK are the most time-consuming steps and may require generous time limits (default: 150h and 100h respectively).
Install Snakemake:
mamba create -c conda-forge -c bioconda -c nodefaults -n snakemake snakemake snakemake-executor-plugin-cluster-generic mamba activate snakemakeDry run to preview all steps:
snakemake --profile profile/slurm/ --configfile config/default.yaml --dry-runRemove --dry-run to start the actual run.
Move genome assemblies (or create empty placeholder files) into input/genomes/, then place BUSCO output directories under results/busco/. Expected structure for Ailurus_fulgens.fasta:
results/ busco/ Ailurus_fulgens/ busco_sequences/ fragmented_busco_sequences/ multi_copy_busco_sequences/ single_copy_busco_sequences/ hmmer_output/ logs/ metaeuk_output/ full_table_Ailurus_fulgens.tsv missing_busco_list_Ailurus_fulgens.tsv short_summary_Ailurus_fulgens.txt short_summary.json short_summary.specific.mammalia_odb10.Ailurus_fulgens.json short_summary.specific.mammalia_odb10.Ailurus_fulgens.txt Please email: andrey.tomarovsky@gmail.com for questions or feedback.
