Snakemake workflow: BuscoClade

Description

Pipeline to construct species phylogenies from genome assemblies or variant call data (VCF).

Workflow

flowchart TD %% ----- INPUT ----- subgraph INPUT["Input data"] A_fa["Genome assemblies (FASTA)"] A_vcf["Per-sample VCFs + reference"] A_vcf2["Multi-sample VCF"] end %% ----- BUSCO ----- subgraph BUSCO["Ortholog extraction"] B_busco["BUSCO"] end %% ----- PREPROCESSING ----- subgraph PREP["Sequence processing"] subgraph ALN["Multiple alignment"] C_aln["MAFFT / MUSCLE / PRANK"] end subgraph FLT["Trimming"] C_flt["ClipKIT / GBlocks / TrimAl"] end end %% ----- PHYLOGENY ----- subgraph PHYLO["Phylogenetic tree inference"] subgraph CONCAT["Supermatrix approach"] E_phy["IQTree / MrBayes / PHYLIP / RAxML-NG / RapidNJ"] end subgraph TREE["Multispecies coalescent"] D_ast["Astral-IV"] end end %% ----- EDGES: MAIN ----- A_fa --> B_busco A_vcf -->|"GATK FastaAlternateReferenceMaker"| B_busco B_busco --> C_aln C_aln --> C_flt C_flt -->|"Concat alignment"| E_phy C_flt -->|"IQTree per gene"| D_ast %% ----- EDGES: VCF2PHYLIP ----- A_vcf2 -. "vcf2phylip.py" .-> E_phy %% ----- STYLE ----- classDef input fill:#e8f4ff,stroke:#2b7cd3,stroke-width:1px classDef process fill:#eaf7ea,stroke:#2f9e44,stroke-width:1px classDef phylo fill:#fff4e6,stroke:#e67700,stroke-width:1px classDef optional fill:#e8f4ff,stroke:#2b7cd3,stroke-width:1px,stroke-dasharray:4 4 class A_fa,A_vcf input class B_busco,C_aln,C_flt process class D_ast,E_phy phylo class A_vcf2 optional

Ortholog extraction: BUSCO
VCF-based reconstruction: GATK FastaAlternateReferenceMaker, vcf2phylip
Alignment: MAFFT, MUSCLE, PRANK
Trimming: ClipKIT, TrimAl, GBlocks
Phylogenetic tree construction: IQTree, MrBayes, ASTRAL-IV, RapidNJ, PHYLIP, RAxML-NG
Visualization: Etetoolkit, Matplotlib

Usage

Step 1. Deploy workflow

Clone the repository or download the latest release:

git clone https://github.com/tomarovsky/BuscoClade.git

Step 2. Prepare input data

FASTA assemblies

Place genome assemblies into input/genomes/. The file prefix is used as the sample name in the output phylogeny. Supported extensions: .fasta, .fna, .fa, and their gzipped versions (.fasta.gz, .fna.gz, .fa.gz).

Per-sample VCFs + reference genome

If you have per-sample VCFs, the pipeline can reconstruct pseudo-genome assemblies using GATK FastaAlternateReferenceMaker, which are then fed into the standard BUSCO workflow alongside any FASTA assemblies.

Place per-sample VCF files and the corresponding reference genome together into a subdirectory under input/vcf_reconstruct/. Each subdirectory is processed independently, which allows reconstructing pseudo-genomes against different references in a single run:

input/ genomes/ Species1.fasta Species2.fasta.gz vcf_reconstruct/ project_hg38/ # one reference per directory reference.fasta SampleA.vcf.gz SampleB.vcf.gz project_mm39/ # another reference reference.fasta SampleC.vcf.gz

The directory name is used only for organization — the VCF file prefix determines the sample name in the output phylogeny. No additional config changes are needed; the pipeline detects subdirectories automatically.

Multi-sample VCF via vcf2phylip (optional)

As an alternative to the BUSCO-based pipeline, a concatenated phylip alignment can be built directly from a multi-sample VCF using vcf2phylip.py, bypassing BUSCO and sequence alignment entirely. When vcf2phylip: True is set, only this route is executed.

Place exactly one multi-sample .vcf.gz file into input/vcf2phylip/ and enable the option in the config:

input/ vcf2phylip/ all_samples.vcf.gz # exactly one multi-sample VCF

vcf2phylip: True

Step 3. Configure workflow

Modify config/default.yaml (recommended: copy it and pass with --configfile). The config has four sections:

Pipeline configuration

Enable or disable tools and modes:

vcf2phylip: False # set True to use vcf2phylip instead of BUSCO-based pipeline quastcore: True # assembly statistics alignment: "mafft" # 'mafft', 'muscle' or 'prank' filtration: "clipkit" # 'clipkit', 'trimal' or 'gblocks' iqtree: True astral: True rapidnj: True phylip: True raxml: True mrbayes: False # recommended to run GPU-compiled version separately draw_phylotrees: True

Tool parameters

Key parameters to configure before running:

BUSCO:

busco_dataset_path: Path to a pre-downloaded OrthoDB dataset (e.g. "$TOOLS/busco_datasets/mammalia_odb12/").
busco_options: Use "--offline" to run without internet access.
busco_mode: Typically "genome".
busco_blacklist: Path to a file with BUSCO IDs to exclude (optional).

Alignment (parameters passed directly to the chosen tool):

prank_params, mafft_params, muscle_params

Filtration:

clipkit_params, gblocks_params, trimal_params

Phylogenetic inference:

iqtree_params: e.g. "-keep-ident -m TESTNEW -bb 1000". Add -o 'OUTGROUP' to set an outgroup.
astral_params: e.g. "--support 2". Add --root 'OUTGROUP' to set an outgroup.
raxml_params: e.g. "--model GTR+G --bs-trees 100".
rapidnj_params: e.g. "-b 1000".
phylip_dnadist_params: Use "D\n" for Kimura 2-parameter model, or "" for F84 (default).
phylip_neighbor_params: Use "N" for UPGMA, or "" for NJ (default).
mrbayes_params, mrbayes_block: MrBayes configuration block file and extra parameters.

Visualization:

tree_visualization_params: Specify outgroup as "--outgroup OUTGROUP".

Directory structure

Input and output paths are defined here. The defaults are:

# Input genome_dir: "input/genomes/" vcf_reconstruct_dir: "input/vcf_reconstruct/" vcf2phylip_dir: "input/vcf2phylip/" # Output (all under results/) output_dir: "results/"

It is recommended to leave the directory structure unchanged.

Resources

Per-tool Slurm settings: partition (*_queue), threads (*_threads), memory in MB (*_mem_mb), and runtime (*_time). Adjust these to match your cluster configuration. Note that BUSCO and PRANK are the most time-consuming steps and may require generous time limits (default: 150h and 100h respectively).

Step 4. Execute workflow

Install Snakemake:

mamba create -c conda-forge -c bioconda -c nodefaults -n snakemake snakemake snakemake-executor-plugin-cluster-generic mamba activate snakemake

Dry run to preview all steps:

snakemake --profile profile/slurm/ --configfile config/default.yaml --dry-run

Remove --dry-run to start the actual run.

Advanced usage

Starting from completed BUSCO results

Move genome assemblies (or create empty placeholder files) into input/genomes/, then place BUSCO output directories under results/busco/. Expected structure for Ailurus_fulgens.fasta:

results/ busco/ Ailurus_fulgens/ busco_sequences/ fragmented_busco_sequences/ multi_copy_busco_sequences/ single_copy_busco_sequences/ hmmer_output/ logs/ metaeuk_output/ full_table_Ailurus_fulgens.tsv missing_busco_list_Ailurus_fulgens.tsv short_summary_Ailurus_fulgens.txt short_summary.json short_summary.specific.mammalia_odb10.Ailurus_fulgens.json short_summary.specific.mammalia_odb10.Ailurus_fulgens.txt

Contact

Please email: andrey.tomarovsky@gmail.com for questions or feedback.

Name		Name	Last commit message	Last commit date
Latest commit History 252 Commits
config		config
input		input
profile		profile
resources		resources
results		results
workflow		workflow
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Snakefile		Snakefile
logo.png		logo.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Snakemake workflow: BuscoClade

Description

Workflow

Usage

Step 1. Deploy workflow

Step 2. Prepare input data

FASTA assemblies

Per-sample VCFs + reference genome

Multi-sample VCF via vcf2phylip (optional)

Step 3. Configure workflow

Pipeline configuration

Tool parameters

Directory structure

Resources

Step 4. Execute workflow

Advanced usage

Starting from completed BUSCO results

Contact

About

Uh oh!

Releases 12

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Snakemake workflow: BuscoClade

Description

Workflow

Usage

Step 1. Deploy workflow

Step 2. Prepare input data

FASTA assemblies

Per-sample VCFs + reference genome

Multi-sample VCF via vcf2phylip (optional)

Step 3. Configure workflow

Pipeline configuration

Tool parameters

Directory structure

Resources

Step 4. Execute workflow

Advanced usage

Starting from completed BUSCO results

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 12

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages