I'm trying to merge three RDD's based on the same key. The following is the data.
+------+---------+-----+ |UserID|UserLabel|Total| +------+---------+-----+ | 2| Panda| 15| | 3| Candy| 15| | 1| Bahroze| 15| +------+---------+-----+ +------+---------+-----+ |UserID|UserLabel|Total| +------+---------+-----+ | 2| Panda| 7342| | 3| Candy| 5669| | 1| Bahroze| 8361| +------+---------+-----+ +------+---------+-----+ |UserID|UserLabel|Total| +------+---------+-----+ | 2| Panda| 37| | 3| Candy| 27| | 1| Bahroze| 39| +------+---------+-----+ I'm able to merge these three DF. I converted them to RDD dict with the following code for all three
new_rdd = userTotalVisits.rdd.map(lambda row: row.asDict(True)) After RDD conversion, I'm taking one RDD and the other two as lists. Mapping the first RDD and then adding other keys to it based on the same UserID. I was hoping there was a better way of doing this using pyspark. Here's the code I've written.
def transform(row): # Add a new key to each row for x in conversion_list: # first rdd in list of object as[{}] after using collect() if( x['UserID'] == row['UserID'] ): row["Total"] = { "Visitors": row["Total"], "Conversions": x["Total"] } for y in Revenue_list: # second rdd in list of object as[{}] after using collect() if( y['UserID'] == row['UserID'] ): row["Total"]["Revenue"] = y["Total"] return row potato = new_rdd.map(lambda row: transform(row)) #first rdd How should I efficiently merge these three RDDs/DFs? (because I had to perform three different task on a huge DF). Looking for a better efficient idea. PS I'm still spark newbie. The result of my code does is as follows which is what I need.
{'UserID': '2', 'UserLabel': 'Panda', 'Total': {'Visitors': 37, 'Conversions': 15, 'Revenue': 7342}} {'UserID': '3', 'UserLabel': 'Candy', 'Total': {'Visitors': 27, 'Conversions': 15, 'Revenue': 5669}} {'UserID': '1', 'UserLabel': 'Bahroze', 'Total': {'Visitors': 39, 'Conversions': 15, 'Revenue': 8361}} Thank you.
