0

I currently have a dataset of transaction histories of users in the following format:

+---------+------------+------------+ | user_id | order_date | product_id | +---------+------------+------------+ | 1 | 20190101 | 123 | | 1 | 20190102 | 331 | | 1 | 20190301 | 1029 | +---------+------------+------------+ 

I'm trying to transform the dataset to be used for an Item2Vec model -- which I believe has to look like this:

+---------+-------------------+ | user_id | seq_vec | +---------+-------------------+ | 1 | [123, 331, 1029] | ------------------------------- 

I'm assuming the dataset has to be formatted this way from looking at examples of Word2Vec (https://spark.apache.org/docs/2.2.0/ml-features.html#word2vec).

Is there a built-in PySpark method of creating a vector from the values in product_id column if I'm grouping by user_id?

0

1 Answer 1

1

collect_list does the trick

import pyspark.sql.functions as F rawData = [(1, 20190101, 123), (1, 20190102, 331), (1, 20190301, 1029)] df = spark.createDataFrame(rawData).toDF("user_id", "order_date", "product_id") df.groupBy("user_id").agg(F.collect_list("product_id").alias("vec")).show() +-------+----------------+ |user_id| vec| +-------+----------------+ | 1|[123, 331, 1029]| +-------+----------------+ 
Sign up to request clarification or add additional context in comments.

1 Comment

This works if the vector order does not matter. It gets much more complicated if it does. See: stackoverflow.com/questions/39505599/…

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.