snakemake batch creation of output

Question

Generating output based on changed input files in Snakemake is easy:

rule all: input: [f'out_{i}.txt' for i in range(10)] rule make_input: output: 'in_{i}.txt' shell: 'touch {output}' rule make_output_parallel: input: 'in_{i}.txt' output: 'out_{i}.txt' shell: 'touch {output}'

In this case, make_output will only run for instances where in_{i}.txt have changed.

But suppose the 'out_{i}.txt' cannot be generated in parallel and I want to generate them in a single step, like,

rule make_output_one_step: input: [f'in_{i}.txt' for i in range(10)] output: [f'out_{i}.txt' for i in range(10)] shell: 'touch {output}'

If only one of the in_{i}.txt files have changed, I don't need to regenerate all 10 of them. How can I adjust make_output_one_step.output to generate only the needed files?

Neither make_output nor all depend on any in_{i}.txt file. There is no "generating output based on changed input files" in your script. — Dmitry Kuzminov
– Dmitry Kuzminov, Commented Mar 27, 2020 at 22:35
Your intentions are not clear. First, the script in make_output doesn't depend on the input de facto. Next, the make_input produces exactly the same files, so the timestamp de facto is not important. And finally I don't see any reason in your intention "to generate them in a single step". I see a logical fallacy. — Dmitry Kuzminov
– Dmitry Kuzminov, Commented Mar 28, 2020 at 16:14
@DmitryKuzminov, this is meant to be a simple example, not my use case. I do not know what "logical fallacy" you are referring to--could you clarify? — goi42
– goi42, Commented Mar 30, 2020 at 20:32

Maarten-vd-Sande · Accepted Answer · 2020-03-31 06:52:15Z

If you want some parts of the pipeline to not work in parallel for whatever reason (RAM, internet usage, IO, API limit, etc....) you can make use of resources.

rule all: input: [f'out_{i}.txt' for i in range(10)] rule make_input: output: 'in_{i}.txt' shell: 'touch {output}' rule make_output: input: 'in_{i}.txt' output: 'out_{i}.txt' resources: max_parallel=1 shell: 'touch {output}'

And then you can call your pipeline like snakemake --resources max_parallel=1 --cores 10. In this case all the jobs of rule make_input will run in parallel, but only one instance of make_output will run in parallel.

Thank you. It sounds like this is an effective way to prevent rules from running in parallel. It would still require make_output being called 10 times instead of in a single step, which is what I would prefer, but perhaps this is the best solution Snakemake can achieve.

Collectives™ on Stack Overflow

snakemake batch creation of output

1 Answer 1

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Related