I used this and this to run 2 function calls in parallel, but the times are barely improving. This is my code:
Sequential:
from nltk import pos_tag def posify(txt): return ' '.join([pair[1] for pair in pos_tag(txt.split())]) df1['pos'] = df1['txt'].apply(posify) # ~15 seconds df2['pos'] = df2['txt'].apply(posify) # ~15 seconds # Total Time: 30 seconds Parallel:
from nltk import pos_tag import multiprocessing def posify(txt): return ' '.join([pair[1] for pair in pos_tag(txt.split())]) def posify_parallel(ser, key_name, shared_dict): shared_dict[key_name] = ser.apply(posify) manager = multiprocessing.Manager() return_dict = manager.dict() p1 = multiprocessing.Process(target=posify_parallel, args=(df1['txt'], 'df1', return_dict)) p1.start() p2 = multiprocessing.Process(target=posify_parallel, args=(df2['txt'], 'df2', return_dict)) p2.start() p1.join(), p2.join() df1['pos'] = return_dict['df1'] df2['pos'] = return_dict['df2'] # Total Time: 27 seconds I would expect the total time to be about 15 seconds, but I'm getting 27 seconds.
If it makes any difference, I have an i7 2.6GHz CPU with 6 cores (12 logical).
Is it possible to achieve something around 15 seconds? Does this have something to do with the pos_tag function itself?
EDIT:
I ended up just doing the following and now it's 15 seconds:
with Pool(cpu_count()) as pool: df1['pos'] = pool.map(posify, df1['txt']) df2['pos'] = pool.map(posify, df2['txt']) I think this way the lines run sequentially, but each of them runs in parallel internally. As long as it's 15 seconds, that's fine with me.