Parallelise operations on pandas dataframes with multiprocessing - what am I doing wrong? [duplicate]

Question

I am trying to find a way to parallelise certain operations on dataframes, especially those that cannot be vectorised. I have tested the code below, taken from http://www.racketracer.com/2016/07/06/pandas-in-parallel/ , but it doesn't work. No error message - quite simply, nothing happens. Debugging it, it seems the code gets stuck at df = pd.concat(pool.map(func, df_split)) , but without any error messages.

What am I doing wrong?

import timeit import pandas as pd import numpy as np import seaborn as sns import multiprocessing from multiprocessing import Pool def parallelize_dataframe(df, func): df_split = np.array_split(df, num_partitions) pool = Pool(num_cores) df = pd.concat(pool.map(func, df_split)) pool.close() pool.join() return df def multiply_columns(data): data['length_of_word'] = data['species'].apply(lambda x: len(x)) return data num_partitions = 2 #number of partitions to split dataframe num_cores = 2# multiprocessing.cpu_count() #number of cores on your machine iris = pd.DataFrame(sns.load_dataset('iris')) iris = parallelize_dataframe(iris, multiply_columns)

Is there a reason why you are not using e.g. dask?

Quickbeam2k1
– Quickbeam2k1

2019-03-20 10:58:22 +00:00
Commented Mar 20, 2019 at 10:58 — Quickbeam2k1
– Quickbeam2k1, Commented Mar 20, 2019 at 10:58

Pythonista anonymous · Accepted Answer · 2019-03-20 10:43:31Z

0

I needed to add

if __name__ == "__main__":

answered Mar 20, 2019 at 10:43

Pythonista anonymous

9,12022 gold badges79 silver badges122 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Amsakanna Over a year ago

Please use the edit link on your question to add additional information. The Post Answer button should be used only for complete answers to the question. - From Review

Pythonista anonymous Over a year ago

I am not following. The complete answer to the question is that parallelize(dataframe) must be run only if name=="__main" , which is what I have written. I could have made it more explicit, but it seemed pretty obvious to me

Dirk Horsten Over a year ago

Please include the lines before and after this if-statement you want to include. (Always think of your posts here as entries in a knowledge base, not just a chat)

Amsakanna Over a year ago

This looked more like a comment saying that you forgot to add something to your question. Anyways cheers!

Collectives™ on Stack Overflow

Parallelise operations on pandas dataframes with multiprocessing - what am I doing wrong? [duplicate]

1 Answer 1

4 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Linked

Related