GHGA AQuA (Automatic Quality Assessment) Pipeline is a bioinformatics pipeline that performs basic quality control over input datasets without altering the raw data. It accepts three main input types:
- Raw FastQ files
- Aligned BAM/CRAM files
- Variant called VCF/BCF files
The pipeline automatically selects the appropriate quality control tools based on your provided analysis method (e.g., WGS, RNA-Seq, Nanopore) and compiles the results into a single MultiQC report.
The following table details which tools are executed based on the analysis method and input data type provided in the samplesheet.
| Analysis Method | Read QC (FastQ) | Alignment QC (BAM/CRAM) | Variant QC (VCF) |
|---|---|---|---|
| WGS / WES / TES | FastQC, FastP, SeqFU | Mosdepth, Samtools Stats, Picard, VerifyBamID, NGS-Bits* | BCFTools Stats |
| ATAC / ChIP-Seq | FastQC, FastP, SeqFU | Mosdepth, Samtools Stats, Picard | BCFTools Stats |
| RNA-Seq / smRNA | FastQC, FastP, SeqFU | RSeQC | BCFTools Stats |
| Nanopore | FastQC, NanoPlot | - | BCFTools Stats |
| PacBio | FastPLong | - | BCFTools Stats |
| MethylSeq | FastQC, FastP, SeqFU | - | BCFTools Stats |
* NGS-Bits SampleGender is run for WGS if
predict_sexis enabled.
The pipeline can be started in two ways: by providing a manual samplesheet or by providing GHGA-compliant metadata.
Create a samplesheet.csv with your data. The pipeline auto-detects the starting step based on which columns are populated.
You must create a samplesheet.csv containing your input data. The structure requires a step column to tell the pipeline which type of file you are providing:
- step 1: FastQ files (Read QC)
- step 2: BAM/CRAM files (Alignment QC)
- step 3: VCF files (Variant QC)
A samplesheet containing a mix of raw data, mapped bams, and variant files would look like this:
sample,lane,individual_id,sex,experiment_method,fastq_1,fastq_2,bam,bai,vcf SAMPLE_FASTQ,L001,ind_1,MALE,wgs,s1_R1.fastq.gz,s1_R2.fastq.gz,,, SAMPLE_BAM,L001,ind_2,FEMALE,wgs,,,s2.bam,s2.bam.bai, SAMPLE_VCF,L001,ind_3,NA,wgs,,,,,s3.vcf.gz| Column | Description |
|---|---|
sample | Required. Custom sample name. This identifier is used to group multiple sequencing runs (lanes) from the same sample. Spaces are automatically converted to underscores (_). |
lane | Required. identifier for the sequencing lane or library (e.g., L001, L002). Must not contain spaces. |
individual_id | Identifier for the individual (patient/subject). |
sex | Biological sex of the individual (e.g., MALE, FEMALE, NA). |
status | Disease status as an integer: 0 (Normal/Control) or 1 (Tumor/Case). |
phenotype | Phenotypic terms or descriptions associated with the individual. |
sample_type | The type of sample (e.g., GENOMIC_DNA, TOTAL_RNA). |
disease_status | Text description of the disease status (e.g., Healthy, Tumor). |
case_control_status | Status in the study design (e.g., CASE, CONTROL). |
tissue | The source tissue of the specimen (e.g., blood, tissue). |
experiment_method | The sequencing method used. Supported values: wgs, wes, rna, atac, nanopore, pacbio. |
analysis_method | The type of analysis performed (e.g., varcall). |
fastq_1 | Path to the Read 1 FastQ file. Must end in .fastq.gz or .fq.gz. |
fastq_2 | Path to the Read 2 FastQ file for paired-end data. Optional for single-end. |
single_end | Boolean (true/false) indicating if the sequencing is single-end. |
bam | Path to the aligned BAM file. |
bai | Path to the corresponding BAM index file. |
cram | Path to the aligned CRAM file. |
crai | Path to the corresponding CRAM index file. |
vcf | Path to the Variant Call Format file. Must end in .vcf or .vcf.gz. |
data_files | Semicolon-separated list of any other relevant data files not covered by specific columns. |
An example samplesheet has been provided with the pipeline.
If you already have a metadata.json following the GHGA metadata model, you can provide it directly. The pipeline will automatically convert the JSON into the required internal format, eliminating the need to create a manual samplesheet.
Run the pipeline using the command below with input samplesheet.csv.
nextflow run main.nf \ -profile docker/singularity/conda \ --input samplesheet.csv \ --outdir ./resultsor using the command below with input metadata.json.
nextflow run main.nf \ -profile docker/singularity/conda \ --metadata metadata.json \ --outdir ./resultsWarning
Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.
- FastQC: Comprehensive quality control checks on raw sequence data.
- FastP: All-in-one FASTQ preprocessor (used here for QC metrics).
- SeqFU: Sequence statstics
- NanoPlot: Plotting tool for long read sequencing data and alignments.
- FastPLong: Quality control for long read data (PacBio).
- Mosdepth: Fast BAM/CRAM depth calculation.
- Samtools Stats: General statistics for alignment files.
- Picard CollectMultipleMetrics: Collects multiple classes of metrics from alignment files.
- RSeQC: Quality control for RNA-seq experiments.
- NGS-Bits SampleGender: Sex determination based on coverage.
- VerifyBamID: A robust tool for DNA contamination estimation from sequence reads using ancestry-agnostic method.
- BCFTools Stats: Statistics for VCF/BCF files.
- MultiQC: Aggregates results from all tools into a single HTML report.
GHGA AQuA nextflow pipeline was originally written by Kubra Narci @kubranarci.
Current development team: - Manuel Kösters - Virag Sharma - Ruchi Tanavade
If you would like to contribute to this pipeline, please see the contributing guidelines.
An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.
This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.
