2

I created a 2 columns pandas df with random.int method to generate a second two column dataframe applying groupby operations. df.col1 is a series of lists, df.col2 a series of integers, and elements inside the list are type 'numpy.int64', same for the elements of the second column, as result of random.int.

df.a df.b 3 7 5 2 1 8 ... groupby operations df.col1 df.col2 [1,2,3...] 1 [2,5,6...] 2 [6,4,....] 3 ... 

When I try to crete the pyspark.sql dataframe with spark.createDataFrame(df), I get this error: TypeError: not supported type: type 'numpy.int64'.

Going back to the df generation, I tried different methods to convert the elements from numpy.int64 to python int, but none of theme worked:

np_list = np.random.randint(0,2500, size = (10000,2)).astype(IntegerType) df = pd.DataFrame(np_list,columns = list('ab'), dtype = 'int') 

I also tried to map with lambda x: int(x) or x.item() but the type still remains 'numpy.int64'.

According to pyspark.sql documentation, it should be possible to load a pandas dataframe, but it seems not compatible when it comes with numpy values. Any hints?

Thanks!

2 Answers 2

1

Well the way how you do it doesn't work. If you have something like this. You will get the error because of the first column. Spark doesn't understand a list with the type numpy.int64

df.col1 df.col2 [1,2,3...] 1 [2,5,6...] 2 [6,4,....] 3 ... 

If you have something like this. The this should be okay.

df.a df.b 3 7 5 2 1 8 

In terms of your code, try this:

np_list = np.random.randint(0,2500, size = (10000,2)) df = pd.DataFrame(np_list,columns = list('ab')) spark_df = spark.createDataFrame(df) 

You don't really need to cast this as int again and if you want to do it explicitly, then it is array.astype(int). Then just do spark_df.head. This should work!

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks @DatTran, that works. however the df I want to convert is the first one ( witch columns co1 and col2 and lists) not the second one; that's why I tried to force the conversion from 'numpy.int64' to 'python int' in the other dataframe.
@csbr again here... you need to accept those answer ppl provided you which solve your issue
0

This is far from being a perfect solution, but this is what I actually run in production to get results:

 for col_name in ['integer column', 'other int column']: df3[col_name] = pd.to_numeric(df3[col_name], downcast='integer').astype('Int64') def df_generator(df_in: pd.DataFrame) -> Generator[list, None, None]: # As PySpark won't accept numpy.Int64, do the stupid thing and iterate entire dataframe to # do any type conversion by ourselves. for row_idx, row in df_in.iterrows(): row_out = [] for data in row: if isinstance(data, str): row_out.append(data) elif data is None or data is pd.NA: row_out.append(None) else: row_out.append(int(data)) yield row_out spark_df = spark_session.createDataFrame(df_generator(df3), schema=schema) 

First I force imported data with known columns containing integers from strings into numbers and do another forcing of Int64 of them.

Then entire Pandas dataframe is converted into PySpark dataframe. It's a simple generator function to iterate entire Pandas dataframe (which is both discouraged and stupid) returning the exact same data as lists with proper data types.

I wish PySpark team would address this shortcoming to simplify working with Pandas-sourced data.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.