0

I have edited the code , currently it is working fine . But thinks it is not executing parallely or dynamically . Can anyone please check on to it

Code :

def folderStatistic(t): j, dir_name = t row = [] for content in dir_name.split(","): row.append(content) print(row) def get_directories(): import csv with open('CONFIG.csv', 'r') as file: reader = csv.reader(file,delimiter = '\t') return [col for row in reader for col in row] def folderstatsMain(): freeze_support() start = time.time() pool = Pool() worker = partial(folderStatistic) pool.map(worker, enumerate(get_directories())) def datatobechecked(): try: folderstatsMain() except Exception as e: # pass print(e) if __name__ == '__main__': datatobechecked() 

Config.CSV

C:\USERS, .CSV C:\WINDOWS , .PDF etc. 

There may be around 200 folder paths in config.csv

2
  • Can you please explain what your snippet and especially folderStatistics function is meant to do? I've got the feeling that the problem is in the implementation of this method. Have you tried running it in a single-threaded fashion? Commented Jun 2, 2021 at 14:34
  • 1
    worker = partial(folderStatistic) is not accomplishing anything of value; you might as well have pool.map(folderStatistic, enumerate(get_directories())). Commented Jun 2, 2021 at 15:37

2 Answers 2

1

welcome to StackOverflow and Python programming world!

Moving on to the question. Inside the get_directories() function you open the file in with context, get the reader object and close the file immediately after the moment you leave the context so when the time comes to use the reader object the file is already closed.

I don't want to discourage you, but if you are very new to programming do not dive into parallel programing yet. Difficulty in handling multiple threads simultaneously grows exponentially with every thread you add (pools greatly simplify this process though). Processes are even worse as they don't share memory and can't communicate with each other easily.

My advice is, try to write it as a single-thread program first. If you have it working and still need to parallelize it, isolate a single function with input file path as a parameter that does all the work and then use thread/process pool on that function.

EDIT: From what I can understand from your code, you get directory names from the CSV file and then for each "cell" in the file you run parallel folderStatistics. This part seems correct. The problem may lay in dir_name.split(","), notice that you pass individual "cells" to the folderStatistics not rows. What makes you think it's not running paralelly?.

Sign up to request clarification or add additional context in comments.

3 Comments

Yeah i got it , but seems it is not executing parallely .. Am literally so curious to this topic . I will update my question can u please check
It takes sometime as that of running without Multiprocessing
Check what part takes the longest. Keep in mind that you read the csv in your main process, from your snippet that seems to be the most time consuming part, and then run folderStatistics in separate processes. That might be why you get the same results. Try putting time.sleep(0.1) into folderStatistics and then compare the results.
1

There is a certain amount of overhead in creating a multiprocessing pool because creating processes is, unlike creating threads, a fairly costly operation. Then those submitted tasks, represented by each element of the iterable being passed to the map method, are gathered up in "chunks" and written to a multiprocessing queue of tasks that are read by the pool processes. This data has to move from one address space to another and that has a cost associated with it. Finally when your worker function, folderStatistic, returns its result (which is None in this case), that data has to be moved from one process's address space back to the main process's address space and that too has a cost associated with it.

All of those added costs become worthwhile when your worker function is sufficiently CPU-intensive such that these additional costs is small compared to the savings gained by having the tasks run in parallel. But your worker function's CPU requirements are so small as to reap any benefit from multiprocessing.

Here is a demo comparing single-processing time vs. multiprocessing times for invoking a worker function, fn, twice where the first time it only performs its internal loop 10 times (low CPU requirements) while the second time it performs its internal loop 1,000,000 times (higher CPU requirements). You can see that in the first case the multiprocessing version runs considerable slower (you can't even measure the time for the single processing run). But when we make fn more CPU-intensive, then multiprocessing achieves gains over the single-processing case.

from multiprocessing import Pool from functools import partial import time def fn(iterations, x): the_sum = x for _ in range(iterations): the_sum += x return the_sum # required for Windows: if __name__ == '__main__': for n_iterations in (10, 1_000_000): # single processing time: t1 = time.time() for x in range(1, 20): fn(n_iterations, x) t2 = time.time() # multiprocessing time: worker = partial(fn, n_iterations) t3 = time.time() with Pool() as p: results = p.map(worker, range(1, 20)) t4 = time.time() print(f'#iterations = {n_iterations}, single processing time = {t2 - t1}, multiprocessing time = {t4 - t3}') 

Prints:

#iterations = 10, single processing time = 0.0, multiprocessing time = 0.35399389266967773 #iterations = 1000000, single processing time = 1.182999849319458, multiprocessing time = 0.5530076026916504 

But even with a pool size of 8, the running time is not reduced by a factor of 8 (it's more like a factor of 2) due to the fixed multiprocessing overhead. When I change the number of iterations for the second case to be 100,000,000 (even more CPU-intensive), we get ...

#iterations = 100000000, single processing time = 109.3077495098114, multiprocessing time = 27.202054023742676 

... which is a reduction in running time by a factor of 4 (I have many other processes running in my computer, so there is competition for the CPU).

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.