Skip to main content
16 events
when toggle format what by license comment
Apr 12, 2020 at 21:40 comment added tera_789 @Tfry I actually need sub-directory size so df won't help
Apr 12, 2020 at 20:04 comment added Doc Brown @tera_789: I would recommend you edit the details about the storage hardware into your question (in the comments here below chances are high they will be overlooked)..
Apr 12, 2020 at 9:53 comment added Tfry Oh, and you're really interested in per disk total usage, not per subdirectory usage, right? In that case, using df (via os.system()) will be a lot faster, than calculating directory size, recursively.
Apr 12, 2020 at 9:37 comment added Tfry Good question. However to test whether this is actually the relevant figure, try feeding your algorithm with a) tasks that are each on separate physical disks (and separate physical network devices) and b) tasks that are all on one disk / device. Of course each single tasks should be roughly the same size, to make the measurements comparable.
Apr 12, 2020 at 8:46 comment added tera_789 Actually, I am not sure which workers and how many are assigned to what disks. How could I even check that?
Apr 12, 2020 at 8:46 comment added tera_789 Yes, thousands of physical disks. I knew that networking 'issue' could be there but was hopeful that ThreadPoolExecutor would help.
Apr 12, 2020 at 8:43 comment added Tfry Also, are you making sure that each (physical) disk has a single worker assigned to it? Two workers on the same disk will not be able to actually get much done in parallel. Good/bad luck in the distribution of your workers might explain some of the fluctuation in your timings.
Apr 12, 2020 at 8:42 comment added Tfry Phew. I do think you'll have to tell quite a few more details about your setup. Obviously it involves networking, which, indeed will introduce a whole new layer to look at (and probably much more important than the details of parallelism on the controlling machine). Further: Tens of thousands of disks? Are we talking about physical disks, there? Cause that is what will matter.
Apr 12, 2020 at 8:32 comment added tera_789 Fluctuations in run-time duration do seem weird even to me, which makes me think of QOS etc. because machine was completely free from other tasks.
Apr 12, 2020 at 8:31 comment added tera_789 It is sort of hard to tell how many disks exactly because the pool of disks is tens of thousands and the data blocks may be spread across those. Also, there could be network limitations based on the load and there are QOS policies on each storage device. On top of that, each storage device is intelligent enough to give certain priority to different processes...In terms of network, this is where I expected ThreadPoolExecutor to excel more than others.
Apr 12, 2020 at 8:25 comment added Tfry How many disks? That will be much more important to know than how many CPUs. I would expect optimum results with one thread per disk (and making sure that each thread works on exactly one disk). Also your timings show a huge amount of fluctuation. I do wonder why. Is this network bound? Are other tasks running on this system during your benchmarking?
Apr 12, 2020 at 8:23 comment added tera_789 And, the I/O workload is actually already split across different HDs and storage devices
Apr 12, 2020 at 8:19 comment added tera_789 I reduced the number of workers to 12 (previously it was 24) but nothing really changed...I also posted test results if you wanna take a look...
Apr 12, 2020 at 7:07 history edited Tfry CC BY-SA 4.0
added 232 characters in body
Apr 12, 2020 at 7:05 review First posts
Apr 12, 2020 at 11:29
Apr 12, 2020 at 7:01 history answered Tfry CC BY-SA 4.0