Timeline for How can I improve the speed of scanning multiple directories recursively at the same?
Current License: CC BY-SA 4.0
16 events
| when toggle format | what | by | license | comment | |
|---|---|---|---|---|---|
| Apr 12, 2020 at 21:40 | comment | added | tera_789 | @Tfry I actually need sub-directory size so df won't help | |
| Apr 12, 2020 at 20:04 | comment | added | Doc Brown | @tera_789: I would recommend you edit the details about the storage hardware into your question (in the comments here below chances are high they will be overlooked).. | |
| Apr 12, 2020 at 9:53 | comment | added | Tfry | Oh, and you're really interested in per disk total usage, not per subdirectory usage, right? In that case, using df (via os.system()) will be a lot faster, than calculating directory size, recursively. | |
| Apr 12, 2020 at 9:37 | comment | added | Tfry | Good question. However to test whether this is actually the relevant figure, try feeding your algorithm with a) tasks that are each on separate physical disks (and separate physical network devices) and b) tasks that are all on one disk / device. Of course each single tasks should be roughly the same size, to make the measurements comparable. | |
| Apr 12, 2020 at 8:46 | comment | added | tera_789 | Actually, I am not sure which workers and how many are assigned to what disks. How could I even check that? | |
| Apr 12, 2020 at 8:46 | comment | added | tera_789 | Yes, thousands of physical disks. I knew that networking 'issue' could be there but was hopeful that ThreadPoolExecutor would help. | |
| Apr 12, 2020 at 8:43 | comment | added | Tfry | Also, are you making sure that each (physical) disk has a single worker assigned to it? Two workers on the same disk will not be able to actually get much done in parallel. Good/bad luck in the distribution of your workers might explain some of the fluctuation in your timings. | |
| Apr 12, 2020 at 8:42 | comment | added | Tfry | Phew. I do think you'll have to tell quite a few more details about your setup. Obviously it involves networking, which, indeed will introduce a whole new layer to look at (and probably much more important than the details of parallelism on the controlling machine). Further: Tens of thousands of disks? Are we talking about physical disks, there? Cause that is what will matter. | |
| Apr 12, 2020 at 8:32 | comment | added | tera_789 | Fluctuations in run-time duration do seem weird even to me, which makes me think of QOS etc. because machine was completely free from other tasks. | |
| Apr 12, 2020 at 8:31 | comment | added | tera_789 | It is sort of hard to tell how many disks exactly because the pool of disks is tens of thousands and the data blocks may be spread across those. Also, there could be network limitations based on the load and there are QOS policies on each storage device. On top of that, each storage device is intelligent enough to give certain priority to different processes...In terms of network, this is where I expected ThreadPoolExecutor to excel more than others. | |
| Apr 12, 2020 at 8:25 | comment | added | Tfry | How many disks? That will be much more important to know than how many CPUs. I would expect optimum results with one thread per disk (and making sure that each thread works on exactly one disk). Also your timings show a huge amount of fluctuation. I do wonder why. Is this network bound? Are other tasks running on this system during your benchmarking? | |
| Apr 12, 2020 at 8:23 | comment | added | tera_789 | And, the I/O workload is actually already split across different HDs and storage devices | |
| Apr 12, 2020 at 8:19 | comment | added | tera_789 | I reduced the number of workers to 12 (previously it was 24) but nothing really changed...I also posted test results if you wanna take a look... | |
| Apr 12, 2020 at 7:07 | history | edited | Tfry | CC BY-SA 4.0 | added 232 characters in body |
| Apr 12, 2020 at 7:05 | review | First posts | |||
| Apr 12, 2020 at 11:29 | |||||
| Apr 12, 2020 at 7:01 | history | answered | Tfry | CC BY-SA 4.0 |