I currently have a dataset of transaction histories of users in the following format:
+---------+------------+------------+ | user_id | order_date | product_id | +---------+------------+------------+ | 1 | 20190101 | 123 | | 1 | 20190102 | 331 | | 1 | 20190301 | 1029 | +---------+------------+------------+ I'm trying to transform the dataset to be used for an Item2Vec model -- which I believe has to look like this:
+---------+-------------------+ | user_id | seq_vec | +---------+-------------------+ | 1 | [123, 331, 1029] | ------------------------------- I'm assuming the dataset has to be formatted this way from looking at examples of Word2Vec (https://spark.apache.org/docs/2.2.0/ml-features.html#word2vec).
Is there a built-in PySpark method of creating a vector from the values in product_id column if I'm grouping by user_id?