snakemake derive multiple variables from input file names

Question

I have a problem with deriving variables from input file names - especially if you want to split based on a delimiter. I have tried different approaches (which I can't get to work) and the only one that works so far, fails in the end, because its looking for all possible variations of variables (and hence input files that don't exist).

My problem - the input files are named in the following pattern: 18AR1376_S57_R2_001.fastq.gz

My initial definition of variables at the beginning:
SAMPLES, = glob_wildcards("../run_links/{sample}_R1_001.fastq.gz")

but that ends up with my files all being named 18AR1376_S57 subsequently and I'd like to remove _S57 (which refers to the sample sheet id).

An approach that I have found while searching and which works is this:
SAMPLES,SHEETID, = glob_wildcards("../run_links/{sample}_{SHEETID}_R1.001.fastq.gz"}
but it looks for all possible combinations of sample and sheetid, and hence looks for input files that don't exist.

I then tried a more basic python approach:

SAMPLES, = glob_wildcards("../run_links/{sample}_R1_001.fastq.gz") ID,=expand([{sample}.split("_")[0]], sample=SAMPLES)``

but that doesn't work at all

and then I tried keeping my original wildcard glob SAMPLES, = glob_wildcards("../run_links/{sample}_R1_001.fastq.gz") but define new outputfile names (based on instructions I found in another forum) - but that gives me a syntax error that I can't figure out.

rule trim: input: R1 = "../run_links/{sample}_R1_001.fastq.gz", R2 = "../run_links/{sample}_R2_001.fastq.gz" params: prefix=lambda wildcards, output: output: R1_Pair = ["./output/trimmed/{}_R1_trim_PE.fastq".format({sample}.split("_")[0])], #or the below version R1_Sing = ["./output/trimmed/{}_R1_trim_SE.fastq".format(a) for a in {sample}.split("_")], R2_Pair = ["./output/trimmed/{}_R2_trim_PE.fastq".format(a) for a in {sample}.split("_")], R2_Sing = ["./output/trimmed/{}_R2_trim_SE.fastq".format(a) for a in {sample}.split("_")] resources: cpus=8 log: "../logs/trim_{sample}.log" conda: "envs/trim.yaml" shell: """ trimmomatic PE -trimlog {log} -threads {resources.cpus} {input.R1} {input.R2} {output.R1_Pair} {output.R1_Sing} {output.R2_Pair} {output.R2_Sing} HEADCROP:10 ILLUMINACLIP:scripts/Truseq-Adapter.fa:2:30:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:4:20 MINLEN:50 """

So, two options,

beginning of workflow: split the sample into sample and sample sheet id
define new output names and use split on delimiter _ of {sample}

Does anybody have tips on how to approach this? Thanks

Thanks!

Eric C. · Accepted Answer · 2020-10-12 09:41:26Z

Instead of using glob_wildcards, I would use a simple python dictionary to define your sample names attached to the fastq files:

import os import re d = dict() fastqPath = "." for fastqF in [f for f in os.listdir(fastqPath) if(re.match(r"^[\w-]+_R1_001\.fastq\.gz", f))]: s = re.search(r"(^[\w-]+)_(S\d+)_R1_(001.fastq.gz)", fastqF) samplename = s.group(1) fastqFfile = os.path.join(fastqPath, fastqF) fastqRfile = os.path.join(fastqPath, s.group(1) + "_" + s.group(2) + "_R2_" + s.group(3)) if(os.path.isfile(fastqRfile)): d[samplename] = {"read1":os.path.abspath(fastqFfile),"read2":os.path.abspath(fastqRfile)}

The fastq input files are then really easy to use:

rule all: input: expand("output/trimmed/{sample}_R1_trim_PE.fastq",sample=d) rule trim: input: R1 = lambda wildcards: d[wildcards.sample]["read1"], R2 = lambda wildcards: d[wildcards.sample]["read2"] output: R1_Pair="output/trimmed/{sample}_R1_trim_PE.fastq", R1_Sing="output/trimmed/{sample}_R1_trim_SE.fastq", R2_Pair="output/trimmed/{sample}_R2_trim_PE.fastq", R2_Sing="output/trimmed/{sample}_R2_trim_SE.fastq" resources: cpus=8 log: "../logs/trim_{sample}.log" conda: "envs/trim.yaml" shell: """ trimmomatic PE -trimlog {log} -threads {resources.cpus} {input.R1} {input.R2} {output.R1_Pair} {output.R1_Sing} {output.R2_Pair} {output.R2_Sing} HEADCROP:10 I> """

N.B: I removed the unused params section in your rule trim.

great - thanks a lot for the help! just tried that and it works. the params was a leftover bit of code from a few things that I had tried - good spotting.
@ch_esr don't hesitate to mark the answer as accepted if it helped! thx
ah yes, sorry! I had just upvoted (but it doesn't show, lack of reputation).

Collectives™ on Stack Overflow

snakemake derive multiple variables from input file names

1 Answer 1

3 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Related