Timeline for How can I improve the speed of scanning multiple directories recursively at the same?

Current License: CC BY-SA 4.0

16 events

when toggle format	what		by	license	comment
Apr 12, 2020 at 21:40	comment	added	tera_789		@Tfry I actually need sub-directory size so `df` won't help
Apr 12, 2020 at 20:04	comment	added	Doc Brown		@tera_789: I would recommend you edit the details about the storage hardware into your question (in the comments here below chances are high they will be overlooked)..
Apr 12, 2020 at 9:53	comment	added	Tfry		Oh, and you're really interested in per disk total usage, not per subdirectory usage, right? In that case, using df (via os.system()) will be a lot faster, than calculating directory size, recursively.
Apr 12, 2020 at 9:37	comment	added	Tfry		Good question. However to test whether this is actually the relevant figure, try feeding your algorithm with a) tasks that are each on separate physical disks (and separate physical network devices) and b) tasks that are all on one disk / device. Of course each single tasks should be roughly the same size, to make the measurements comparable.
Apr 12, 2020 at 8:46	comment	added	tera_789		Actually, I am not sure which workers and how many are assigned to what disks. How could I even check that?
Apr 12, 2020 at 8:46	comment	added	tera_789		Yes, thousands of physical disks. I knew that networking 'issue' could be there but was hopeful that `ThreadPoolExecutor` would help.
Apr 12, 2020 at 8:43	comment	added	Tfry		Also, are you making sure that each (physical) disk has a single worker assigned to it? Two workers on the same disk will not be able to actually get much done in parallel. Good/bad luck in the distribution of your workers might explain some of the fluctuation in your timings.
Apr 12, 2020 at 8:42	comment	added	Tfry		Phew. I do think you'll have to tell quite a few more details about your setup. Obviously it involves networking, which, indeed will introduce a whole new layer to look at (and probably much more important than the details of parallelism on the controlling machine). Further: Tens of thousands of disks? Are we talking about physical disks, there? Cause that is what will matter.
Apr 12, 2020 at 8:32	comment	added	tera_789		Fluctuations in run-time duration do seem weird even to me, which makes me think of QOS etc. because machine was completely free from other tasks.
Apr 12, 2020 at 8:31	comment	added	tera_789		It is sort of hard to tell how many disks exactly because the pool of disks is tens of thousands and the data blocks may be spread across those. Also, there could be network limitations based on the load and there are QOS policies on each storage device. On top of that, each storage device is intelligent enough to give certain priority to different processes...In terms of network, this is where I expected `ThreadPoolExecutor` to excel more than others.
Apr 12, 2020 at 8:25	comment	added	Tfry		How many disks? That will be much more important to know than how many CPUs. I would expect optimum results with one thread per disk (and making sure that each thread works on exactly one disk). Also your timings show a huge amount of fluctuation. I do wonder why. Is this network bound? Are other tasks running on this system during your benchmarking?
Apr 12, 2020 at 8:23	comment	added	tera_789		And, the I/O workload is actually already split across different HDs and storage devices
Apr 12, 2020 at 8:19	comment	added	tera_789		I reduced the number of workers to 12 (previously it was 24) but nothing really changed...I also posted test results if you wanna take a look...
Apr 12, 2020 at 7:07	history	edited	Tfry	CC BY-SA 4.0	added 232 characters in body
Apr 12, 2020 at 7:05	review	First posts
Apr 12, 2020 at 11:29
Apr 12, 2020 at 7:01	history	answered	Tfry	CC BY-SA 4.0

toggle format