I have some code that creates a generator with read_sql() and loops through the generator to print each chunk:
execute.py
import pandas as pd from sqlalchemy import event, create_engine engine = create_engine('path-to-driver') def getDistance(chunk): print(chunk) print(type(chunk)) df_chunks = pd.read_sql("select top 2 * from SCHEMA.table_name", engine, chunksize=1) for chunk in df_chunks: result = getDistance(chunk) It works, and each chunk is printed as as DataFrame. When I attempt to do the same thing with multiprocessing like this...
outside_function.py
def getDistance(chunk): print(chunk) print(type(chunk)) df = chunk return df execute.py
import pandas as pd from sqlalchemy import event, create_engine engine = create_engine('path-to-driver') df_chunks = pd.read_sql("select top 2 * from SCHEMA.table_name", engine, chunksize=1) if __name__ == '__main__': global result p = Pool(20) for chunk in df_chunks: print(chunk) result = p.map(getDistance, chunk) p.terminate() p.join() ...the chunks print as column names in the console with the type 'str'. Printing out result reveals this ['column_name'].
Why are the chunks turning into strings that are just the column names when multiprocessing is applied?
df.itertuples()ordf.iterrows(). What exactly are you trying to do?for a in df:that is a string. So no matter what you do with it after that,ais a string. Doesn't matter if you're using multiprocessing or not.