pyspark add new column field with the data frame row number

Question

Hy, I'm trying build a recommendation system with Spark

I have a data frame with users email and movie rating.

df = pd.DataFrame(np.array([["[email protected]",2,3],["[email protected]",5,5],["[email protected]",8,2],["[email protected]",9,3]]), columns=['user','movie','rating']) sparkdf = sqlContext.createDataFrame(df, samplingRatio=0.1) user movie rating [email protected] 2 3 [email protected] 5 5 [email protected] 8 2 [email protected] 9 3

My first doubt it is, pySpark MLlib doesn't accept emails I'm correct? Because this I need to change the email by a Primary key.

My approach was create a temporary table, select distinct user and now I want add a new column with a row number (and this number will be the primary key for each user.

sparkdf.registerTempTable("sparkdf") DistinctUsers = sqlContext.sql("Select distinct user FROM sparkdf")

What I have

+------------+ | user| +------------+ |[email protected]| |[email protected]| |[email protected]| +------------+

What I want

+------------+ | user| PK +------------+ |[email protected]| 1 |[email protected]| 2 |[email protected]| 3 +------------+

Next I will do a join and obtain my final data frame to use in MLlib

user movie rating 1 2 3 1 5 5 2 8 2 3 9 3

Regards and thanks for your time.

Community · Accepted Answer · 2017-05-23 12:23:53Z

Primary keys with Apache Spark practically answers your question but in this particular case using StringIndexer could be a better choice:

from pyspark.ml.feature import StringIndexer indexer = StringIndexer(inputCol="user", outputCol="user_id") indexed = indexer.fit(sparkdf ).transform(sparkdf)

Collectives™ on Stack Overflow

pyspark add new column field with the data frame row number

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related