Firstly I tried everything in the link below to fix my error but none of them worked.
How to convert RDD of dense vector into DataFrame in pyspark?
I am trying to convert a dense vector into a dataframe (Spark preferably) along with column names and running into issues.
My column in spark dataframe is a vector that was created using Vector Assembler and I now want to convert it back to a dataframe as I would like to create plots on some of the variables in the vector.
Approach 1:
from pyspark.ml.linalg import SparseVector, DenseVector from pyspark.ml.linalg import Vectors temp=output.select("all_features") temp.rdd.map( lambda row: (DenseVector(row[0].toArray())) ).toDF() Below is the Error
TypeError: not supported type: <type 'numpy.ndarray'> Approach 2:
from pyspark.ml.linalg import VectorUDT from pyspark.sql.functions import udf from pyspark.ml.linalg import * as_ml = udf(lambda v: v.asML() if v is not None else None, VectorUDT()) result = output.withColumn("all_features", as_ml("all_features")) result.head(5) Error:
AttributeError: 'numpy.ndarray' object has no attribute 'asML' I also tried to convert the dataframe into a Pandas dataframe and after that I am not able to split the values into separate columns
Approach 3:
pandas_df=temp.toPandas() pandas_df1=pd.DataFrame(pandas_df.all_features.values.tolist()) Above code runs fine but I still have only one column in my dataframe with all the values separated by commas as a list.
Any help is greatly appreciated!
EDIT:
Here is how my temp dataframe looks like. It just has one column all_features. I am trying to create a dataframe that splits all of these values into separate columns (all_features is a vector that was created using 200 columns)
+--------------------+ | all_features| +--------------------+ |[0.01193689934723...| |[0.04774759738895...| |[0.0,0.0,0.194417...| |[0.02387379869447...| |[1.89796699621085...| +--------------------+ only showing top 5 rows Expected output is a dataframe with all 200 columns separated out in a dataframe
+----------------------------+ | col1| col2| col3|... +----------------------------+ |0.01193689934723|0.0|0.5049431301173817... |0.04774759738895|0.0|0.1657316216149636... |0.0|0.0|7.213126372469... |0.02387379869447|0.0|0.1866693496827619|... |1.89796699621085|0.0|0.3192169213385746|... +----------------------------+ only showing top 5 rows Here is how my Pandas DF output looks like
0 0 [0.011936899347238104, 0.0, 0.5049431301173817... 1 [0.047747597388952415, 0.0, 0.1657316216149636... 2 [0.0, 0.0, 0.19441761495525278, 7.213126372469... 3 [0.023873798694476207, 0.0, 0.1866693496827619... 4 [1.8979669962108585, 0.0, 0.3192169213385746, ...
.values.tolist()rdd.map(lambda x: (x, )).toDF( )as given in the link you specified? That usually works.