0

Hy, I'm trying build a recommendation system with Spark

I have a data frame with users email and movie rating.

df = pd.DataFrame(np.array([["[email protected]",2,3],["[email protected]",5,5],["[email protected]",8,2],["[email protected]",9,3]]), columns=['user','movie','rating']) sparkdf = sqlContext.createDataFrame(df, samplingRatio=0.1) user movie rating [email protected] 2 3 [email protected] 5 5 [email protected] 8 2 [email protected] 9 3 

My first doubt it is, pySpark MLlib doesn't accept emails I'm correct? Because this I need to change the email by a Primary key.

My approach was create a temporary table, select distinct user and now I want add a new column with a row number (and this number will be the primary key for each user.

sparkdf.registerTempTable("sparkdf") DistinctUsers = sqlContext.sql("Select distinct user FROM sparkdf") 

What I have

+------------+ | user| +------------+ |[email protected]| |[email protected]| |[email protected]| +------------+ 

What I want

+------------+ | user| PK +------------+ |[email protected]| 1 |[email protected]| 2 |[email protected]| 3 +------------+ 

Next I will do a join and obtain my final data frame to use in MLlib

user movie rating 1 2 3 1 5 5 2 8 2 3 9 3 

Regards and thanks for your time.

0

1 Answer 1

2

Primary keys with Apache Spark practically answers your question but in this particular case using StringIndexer could be a better choice:

from pyspark.ml.feature import StringIndexer indexer = StringIndexer(inputCol="user", outputCol="user_id") indexed = indexer.fit(sparkdf ).transform(sparkdf) 
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.