GHGA AQuA Nextflow Pipeline

Introduction

GHGA AQuA (Automatic Quality Assessment) Pipeline is a bioinformatics pipeline that performs basic quality control over input datasets without altering the raw data. It accepts three main input types:

Raw FastQ files
Aligned BAM/CRAM files
Variant called VCF/BCF files

The pipeline automatically selects the appropriate quality control tools based on your provided analysis method (e.g., WGS, RNA-Seq, Nanopore) and compiles the results into a single MultiQC report.

Tool & Analysis Matrix

The following table details which tools are executed based on the analysis method and input data type provided in the samplesheet.

Analysis Method	Read QC (FastQ)	Alignment QC (BAM/CRAM)	Variant QC (VCF)
WGS / WES / TES	FastQC, FastP, SeqFU	Mosdepth, Samtools Stats, Picard, VerifyBamID, NGS-Bits*	BCFTools Stats
ATAC / ChIP-Seq	FastQC, FastP, SeqFU	Mosdepth, Samtools Stats, Picard	BCFTools Stats
RNA-Seq / smRNA	FastQC, FastP, SeqFU	RSeQC	BCFTools Stats
Nanopore	FastQC, NanoPlot	-	BCFTools Stats
PacBio	FastPLong	-	BCFTools Stats
MethylSeq	FastQC, FastP, SeqFU	-	BCFTools Stats

* NGS-Bits SampleGender is run for WGS if predict_sex is enabled.

Usage

The pipeline can be started in two ways: by providing a manual samplesheet or by providing GHGA-compliant metadata.

Option A: Manual Samplesheet (CSV)

Create a samplesheet.csv with your data. The pipeline auto-detects the starting step based on which columns are populated.

You must create a samplesheet.csv containing your input data. The structure requires a step column to tell the pipeline which type of file you are providing:

step 1: FastQ files (Read QC)
step 2: BAM/CRAM files (Alignment QC)
step 3: VCF files (Variant QC)

A samplesheet containing a mix of raw data, mapped bams, and variant files would look like this:

sample,lane,individual_id,sex,experiment_method,fastq_1,fastq_2,bam,bai,vcf SAMPLE_FASTQ,L001,ind_1,MALE,wgs,s1_R1.fastq.gz,s1_R2.fastq.gz,,, SAMPLE_BAM,L001,ind_2,FEMALE,wgs,,,s2.bam,s2.bam.bai, SAMPLE_VCF,L001,ind_3,NA,wgs,,,,,s3.vcf.gz

Column	Description
`sample`	Required. Custom sample name. This identifier is used to group multiple sequencing runs (lanes) from the same sample. Spaces are automatically converted to underscores (`_`).
`lane`	Required. identifier for the sequencing lane or library (e.g., L001, L002). Must not contain spaces.
`individual_id`	Identifier for the individual (patient/subject).
`sex`	Biological sex of the individual (e.g., MALE, FEMALE, NA).
`status`	Disease status as an integer: `0` (Normal/Control) or `1` (Tumor/Case).
`phenotype`	Phenotypic terms or descriptions associated with the individual.
`sample_type`	The type of sample (e.g., GENOMIC_DNA, TOTAL_RNA).
`disease_status`	Text description of the disease status (e.g., Healthy, Tumor).
`case_control_status`	Status in the study design (e.g., CASE, CONTROL).
`tissue`	The source tissue of the specimen (e.g., blood, tissue).
`experiment_method`	The sequencing method used. Supported values: `wgs`, `wes`, `rna`, `atac`, `nanopore`, `pacbio`.
`analysis_method`	The type of analysis performed (e.g., `varcall`).
`fastq_1`	Path to the Read 1 FastQ file. Must end in `.fastq.gz` or `.fq.gz`.
`fastq_2`	Path to the Read 2 FastQ file for paired-end data. Optional for single-end.
`single_end`	Boolean (`true`/`false`) indicating if the sequencing is single-end.
`bam`	Path to the aligned BAM file.
`bai`	Path to the corresponding BAM index file.
`cram`	Path to the aligned CRAM file.
`crai`	Path to the corresponding CRAM index file.
`vcf`	Path to the Variant Call Format file. Must end in `.vcf` or `.vcf.gz`.
`data_files`	Semicolon-separated list of any other relevant data files not covered by specific columns.

An example samplesheet has been provided with the pipeline.

Option B: GHGA Metadata (JSON)

If you already have a metadata.json following the GHGA metadata model, you can provide it directly. The pipeline will automatically convert the JSON into the required internal format, eliminating the need to create a manual samplesheet.

2. Run the Pipeline

Run the pipeline using the command below with input samplesheet.csv.

nextflow run main.nf \ -profile docker/singularity/conda \ --input samplesheet.csv \ --outdir ./results

or using the command below with input metadata.json.

nextflow run main.nf \ -profile docker/singularity/conda \ --metadata metadata.json \ --outdir ./results

Warning

Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.

Supported Tools

Read QC

FastQC: Comprehensive quality control checks on raw sequence data.
FastP: All-in-one FASTQ preprocessor (used here for QC metrics).
SeqFU: Sequence statstics
NanoPlot: Plotting tool for long read sequencing data and alignments.
FastPLong: Quality control for long read data (PacBio).

Alignment QC

Mosdepth: Fast BAM/CRAM depth calculation.
Samtools Stats: General statistics for alignment files.
Picard CollectMultipleMetrics: Collects multiple classes of metrics from alignment files.
RSeQC: Quality control for RNA-seq experiments.
NGS-Bits SampleGender: Sex determination based on coverage.
VerifyBamID: A robust tool for DNA contamination estimation from sequence reads using ancestry-agnostic method.

Variant QC

BCFTools Stats: Statistics for VCF/BCF files.

Reporting

MultiQC: Aggregates results from all tools into a single HTML report.

Credits

GHGA AQuA nextflow pipeline was originally written by Kubra Narci @kubranarci.

Current development team: - Manuel Kösters - Virag Sharma - Ruchi Tanavade

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

Citations

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.devcontainer		.devcontainer
.github		.github
.vscode		.vscode
assets		assets
bin		bin
conf		conf
docs		docs
modules		modules
subworkflows		subworkflows
tests		tests
workflows		workflows
.gitignore		.gitignore
.gitpod.yml		.gitpod.yml
.nf-core.yml		.nf-core.yml
.pre-commit-config.yaml		.pre-commit-config.yaml
.prettierignore		.prettierignore
.prettierrc.yml		.prettierrc.yml
CHANGELOG.md		CHANGELOG.md
CITATIONS.md		CITATIONS.md
LICENSE		LICENSE
README.md		README.md
main.nf		main.nf
modules.json		modules.json
nextflow.config		nextflow.config
nextflow_schema.json		nextflow_schema.json
nf-test.config		nf-test.config
ro-crate-metadata.json		ro-crate-metadata.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GHGA AQuA Nextflow Pipeline

Introduction

Tool & Analysis Matrix

Usage

Option A: Manual Samplesheet (CSV)

Option B: GHGA Metadata (JSON)

2. Run the Pipeline

Supported Tools

Read QC

Alignment QC

Variant QC

Reporting

Credits

Contributions and Support

Citations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GHGA AQuA Nextflow Pipeline

Introduction

Tool & Analysis Matrix

Usage

Option A: Manual Samplesheet (CSV)

Option B: GHGA Metadata (JSON)

2. Run the Pipeline

Supported Tools

Read QC

Alignment QC

Variant QC

Reporting

Credits

Contributions and Support

Citations

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages