Converting pandas Dataframe with Numpy values to pysparkSQL.DataFrame

Question

I created a 2 columns pandas df with random.int method to generate a second two column dataframe applying groupby operations. df.col1 is a series of lists, df.col2 a series of integers, and elements inside the list are type 'numpy.int64', same for the elements of the second column, as result of random.int.

df.a df.b 3 7 5 2 1 8 ... groupby operations df.col1 df.col2 [1,2,3...] 1 [2,5,6...] 2 [6,4,....] 3 ...

When I try to crete the pyspark.sql dataframe with spark.createDataFrame(df), I get this error: TypeError: not supported type: type 'numpy.int64'.

Going back to the df generation, I tried different methods to convert the elements from numpy.int64 to python int, but none of theme worked:

np_list = np.random.randint(0,2500, size = (10000,2)).astype(IntegerType) df = pd.DataFrame(np_list,columns = list('ab'), dtype = 'int')

I also tried to map with lambda x: int(x) or x.item() but the type still remains 'numpy.int64'.

According to pyspark.sql documentation, it should be possible to load a pandas dataframe, but it seems not compatible when it comes with numpy values. Any hints?

Thanks!

Dat Tran · Accepted Answer · 2017-07-31 09:44:43Z

Well the way how you do it doesn't work. If you have something like this. You will get the error because of the first column. Spark doesn't understand a list with the type numpy.int64

df.col1 df.col2 [1,2,3...] 1 [2,5,6...] 2 [6,4,....] 3 ...

If you have something like this. The this should be okay.

df.a df.b 3 7 5 2 1 8

In terms of your code, try this:

np_list = np.random.randint(0,2500, size = (10000,2)) df = pd.DataFrame(np_list,columns = list('ab')) spark_df = spark.createDataFrame(df)

You don't really need to cast this as int again and if you want to do it explicitly, then it is array.astype(int). Then just do spark_df.head. This should work!

Thanks @DatTran, that works. however the df I want to convert is the first one ( witch columns co1 and col2 and lists) not the second one; that's why I tried to force the conversion from 'numpy.int64' to 'python int' in the other dataframe.
@csbr again here... you need to accept those answer ppl provided you which solve your issue

Jari Turkia · Accepted Answer · 2021-08-27 08:47:17Z

This is far from being a perfect solution, but this is what I actually run in production to get results:

 for col_name in ['integer column', 'other int column']: df3[col_name] = pd.to_numeric(df3[col_name], downcast='integer').astype('Int64') def df_generator(df_in: pd.DataFrame) -> Generator[list, None, None]: # As PySpark won't accept numpy.Int64, do the stupid thing and iterate entire dataframe to # do any type conversion by ourselves. for row_idx, row in df_in.iterrows(): row_out = [] for data in row: if isinstance(data, str): row_out.append(data) elif data is None or data is pd.NA: row_out.append(None) else: row_out.append(int(data)) yield row_out spark_df = spark_session.createDataFrame(df_generator(df3), schema=schema)

First I force imported data with known columns containing integers from strings into numbers and do another forcing of Int64 of them.

Then entire Pandas dataframe is converted into PySpark dataframe. It's a simple generator function to iterate entire Pandas dataframe (which is both discouraged and stupid) returning the exact same data as lists with proper data types.

I wish PySpark team would address this shortcoming to simplify working with Pandas-sourced data.

Collectives™ on Stack Overflow

Converting pandas Dataframe with Numpy values to pysparkSQL.DataFrame

2 Answers 2

2 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Linked

Related