0

I have limited experience with multithreading, and I'm currently looking at the pytorch code, where a for loop is parallelized using their custom implementation of parallel_for (it seems to be similarly defined in other codebases and in C++) here:

https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/quantized/cpu/kernels/QuantizedOpKernels.cpp#L2747

My question is, why is it parallelizing over the number of threads? In most use cases where I see a for loop parallelized, it divides the domain (e.g., indices of an array), but here it is dividing the threads. Is this some standard way of multithreading?

1 Answer 1

1

Sayy you want to have a parallel_for loop over 4000 items, and you have 2 CPU's (threads) available. You can choose an arbitrary domain of size 1000. Each thread now needs to process 2 of those domains. You've factored the problem into 2*2*1000.

If you don't choose an arbitrary domain, but let the thread count set the domain size, you factor the problem into 2*2000. This is a bit simpler; there's less overhead for the threads. Each thread gets a single domain.

Sign up to request clarification or add additional context in comments.

2 Comments

ah I see, that makes sense. How do you decide which approach to use? Is it usually obvious, or do you need to benchmark empirically to see which one might be better for a given use case?
@westcoaststudent: Empirically? You start with the easiest solution, which is a plain old for. Usually that's good enough. The next step is to just choose the parallel_for form that's easiest to use. pytorch really is exceptional software; most programmers will never write software that is as widely used. There's no point spending one day to make software run 1 second faster, unless that software is ran at least thousands of times.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.