I have a problem with deriving variables from input file names - especially if you want to split based on a delimiter. I have tried different approaches (which I can't get to work) and the only one that works so far, fails in the end, because its looking for all possible variations of variables (and hence input files that don't exist).
My problem - the input files are named in the following pattern: 18AR1376_S57_R2_001.fastq.gz
My initial definition of variables at the beginning:
SAMPLES, = glob_wildcards("../run_links/{sample}_R1_001.fastq.gz")
but that ends up with my files all being named 18AR1376_S57 subsequently and I'd like to remove _S57 (which refers to the sample sheet id).
An approach that I have found while searching and which works is this:
SAMPLES,SHEETID, = glob_wildcards("../run_links/{sample}_{SHEETID}_R1.001.fastq.gz"}
but it looks for all possible combinations of sample and sheetid, and hence looks for input files that don't exist.
I then tried a more basic python approach:
SAMPLES, = glob_wildcards("../run_links/{sample}_R1_001.fastq.gz") ID,=expand([{sample}.split("_")[0]], sample=SAMPLES)`` but that doesn't work at all
and then I tried keeping my original wildcard glob SAMPLES, = glob_wildcards("../run_links/{sample}_R1_001.fastq.gz") but define new outputfile names (based on instructions I found in another forum) - but that gives me a syntax error that I can't figure out.
rule trim: input: R1 = "../run_links/{sample}_R1_001.fastq.gz", R2 = "../run_links/{sample}_R2_001.fastq.gz" params: prefix=lambda wildcards, output: output: R1_Pair = ["./output/trimmed/{}_R1_trim_PE.fastq".format({sample}.split("_")[0])], #or the below version R1_Sing = ["./output/trimmed/{}_R1_trim_SE.fastq".format(a) for a in {sample}.split("_")], R2_Pair = ["./output/trimmed/{}_R2_trim_PE.fastq".format(a) for a in {sample}.split("_")], R2_Sing = ["./output/trimmed/{}_R2_trim_SE.fastq".format(a) for a in {sample}.split("_")] resources: cpus=8 log: "../logs/trim_{sample}.log" conda: "envs/trim.yaml" shell: """ trimmomatic PE -trimlog {log} -threads {resources.cpus} {input.R1} {input.R2} {output.R1_Pair} {output.R1_Sing} {output.R2_Pair} {output.R2_Sing} HEADCROP:10 ILLUMINACLIP:scripts/Truseq-Adapter.fa:2:30:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:4:20 MINLEN:50 """ So, two options,
- beginning of workflow: split the sample into sample and sample sheet id
- define new output names and use split on delimiter
_of {sample}
Does anybody have tips on how to approach this? Thanks
Thanks!