0

Generating output based on changed input files in Snakemake is easy:

rule all: input: [f'out_{i}.txt' for i in range(10)] rule make_input: output: 'in_{i}.txt' shell: 'touch {output}' rule make_output_parallel: input: 'in_{i}.txt' output: 'out_{i}.txt' shell: 'touch {output}' 

In this case, make_output will only run for instances where in_{i}.txt have changed.

But suppose the 'out_{i}.txt' cannot be generated in parallel and I want to generate them in a single step, like,

rule make_output_one_step: input: [f'in_{i}.txt' for i in range(10)] output: [f'out_{i}.txt' for i in range(10)] shell: 'touch {output}' 

If only one of the in_{i}.txt files have changed, I don't need to regenerate all 10 of them. How can I adjust make_output_one_step.output to generate only the needed files?

8
  • 2
    Neither make_output nor all depend on any in_{i}.txt file. There is no "generating output based on changed input files" in your script. Commented Mar 27, 2020 at 22:35
  • Oops you're right. Edited to make more clear. Commented Mar 28, 2020 at 15:53
  • Your intentions are not clear. First, the script in make_output doesn't depend on the input de facto. Next, the make_input produces exactly the same files, so the timestamp de facto is not important. And finally I don't see any reason in your intention "to generate them in a single step". I see a logical fallacy. Commented Mar 28, 2020 at 16:14
  • You could set a max parallel resource. Commented Mar 28, 2020 at 21:59
  • @DmitryKuzminov, this is meant to be a simple example, not my use case. I do not know what "logical fallacy" you are referring to--could you clarify? Commented Mar 30, 2020 at 20:32

1 Answer 1

1

If you want some parts of the pipeline to not work in parallel for whatever reason (RAM, internet usage, IO, API limit, etc....) you can make use of resources.

rule all: input: [f'out_{i}.txt' for i in range(10)] rule make_input: output: 'in_{i}.txt' shell: 'touch {output}' rule make_output: input: 'in_{i}.txt' output: 'out_{i}.txt' resources: max_parallel=1 shell: 'touch {output}' 

And then you can call your pipeline like snakemake --resources max_parallel=1 --cores 10. In this case all the jobs of rule make_input will run in parallel, but only one instance of make_output will run in parallel.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you. It sounds like this is an effective way to prevent rules from running in parallel. It would still require make_output being called 10 times instead of in a single step, which is what I would prefer, but perhaps this is the best solution Snakemake can achieve.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.