1. Home
2. Questions
3. AI Assist Labs
4. Tags
6. Challenges
7. Chat
8. Articles
9. Users
11. Jobs
12. Companies
13. Collectives
14. Communities for your favorite technologies. Explore all Collectives
Stack Internal

Stack Overflow for Teams is now called Stack Internal. Bring the best of human thought and AI automation together at your work.
Try for free Learn more
Stack Internal
Bring the best of human thought and AI automation together at your work. Learn more

PySpark: create a vector from values in a group [duplicate]

Asked 6 years, 3 months ago

Modified 6 years, 3 months ago

Viewed 953 times

Part of NLP Collective

0

I currently have a dataset of transaction histories of users in the following format:

+---------+------------+------------+ | user_id | order_date | product_id | +---------+------------+------------+ | 1 | 20190101 | 123 | | 1 | 20190102 | 331 | | 1 | 20190301 | 1029 | +---------+------------+------------+

I'm trying to transform the dataset to be used for an Item2Vec model -- which I believe has to look like this:

+---------+-------------------+ | user_id | seq_vec | +---------+-------------------+ | 1 | [123, 331, 1029] | -------------------------------

I'm assuming the dataset has to be formatted this way from looking at examples of Word2Vec (https://spark.apache.org/docs/2.2.0/ml-features.html#word2vec).

Is there a built-in PySpark method of creating a vector from the values in product_id column if I'm grouping by user_id?

asked Aug 2, 2019 at 3:15

Korean_Of_the_Mountain

1,5993 gold badges19 silver badges45 bronze badges

Add a comment |

1 Answer 1

Sorted by:

1

collect_list does the trick

import pyspark.sql.functions as F rawData = [(1, 20190101, 123), (1, 20190102, 331), (1, 20190301, 1029)] df = spark.createDataFrame(rawData).toDF("user_id", "order_date", "product_id") df.groupBy("user_id").agg(F.collect_list("product_id").alias("vec")).show() +-------+----------------+ |user_id| vec| +-------+----------------+ | 1|[123, 331, 1029]| +-------+----------------+

answered Aug 2, 2019 at 4:11

thePurplePython

2,7871 gold badge16 silver badges37 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Austin Fox Over a year ago

This works if the vector order does not matter. It gets much more complicated if it does. See: stackoverflow.com/questions/39505599/…

Start asking to get answers

Find the answer to your question by asking.

Explore related questions

See similar questions with these tags.