Revisions to Custom Partitioner in Pyspark 2.1.0

Improved punctuation marks and and improved formatting.

edit approved Dec 19, 2018 at 7:25

365
1
2
15

It's happening because you are not applying partitionBy on key-value pair rdd. yourYour rdd must need to be be in key-value pair. Also you, your key type should be intinteger. I don't have sample data for your hive table. so letsSo let's demonstrate the fact using below hive table:

I have created a below dataframe using hive table :

nowNow, I wish to partition my dataframe and want to keep the similar keys in one partition. so So, I have converted my dataframe to rdd as you can only apply partitionBy on rdd for re-partitioning.

henceHence, we need to convert our rdd into key-value pair to use paritionBy

now youNow,you can see that rdd has been converted to key value pair and you can therefore distribute your data in partitions according to keys available.

usingUsing a paritionBYparitionBy on key-value rdd now:

Lets take a look at the partitions. dataData is grouped and similar keys are stored into similar partitions now. twoTwo of them are empty.

Now lets say I want to custom partitioning my data. soSo I have created below function to keep keys '1' and '3' in similar partition.

asAs you can see now that keys 1 and 3 are stored in one partition and rest on other.

I hope this helphelps. youYou can try to paritionBypartitionBy your dataframe. makeMake sure to convert it into key value pair and keeping key as type intinteger.

It's happening because you are not applying partitionBy on key-value pair rdd. your rdd must need to be in key-value pair. Also you key type should be int. I don't have sample data for your hive table. so lets demonstrate the fact using below hive table:

I have created a below dataframe using hive table

now I wish to partition my dataframe and want to keep the similar keys in one partition. so I have converted my dataframe to rdd as you can only apply partitionBy on rdd for re-partitioning.

hence, we need to convert our rdd into key-value pair to use paritionBy

now you can see that rdd has been converted to key value pair and you can therefore distribute your data in partitions according to keys available.

using a paritionBY on key-value rdd now:

Lets take a look at the partitions. data is grouped and similar keys are stored into similar partitions now. two of them are empty.

Now lets say I want to custom partitioning my data. so I have created below function to keep keys '1' and '3' in similar partition.

as you can see now that keys 1 and 3 are stored in one partition and rest on other.

I hope this help. you can try to paritionBy your dataframe. make sure to convert it into key value pair and keeping key as type int.

It's happening because you are not applying partitionBy on key-value pair rdd. Your rdd must be in key-value pair. Also, your key type should be integer. I don't have sample data for your hive table. So let's demonstrate the fact using below hive table:

I have created a below dataframe using hive table :

Now, I wish to partition my dataframe and want to keep the similar keys in one partition. So, I have converted my dataframe to rdd as you can only apply partitionBy on rdd for re-partitioning.

Hence, we need to convert our rdd into key-value pair to use paritionBy

Now,you can see that rdd has been converted to key value pair and you can therefore distribute your data in partitions according to keys available.

Using a paritionBy on key-value rdd now:

Lets take a look at the partitions. Data is grouped and similar keys are stored into similar partitions now. Two of them are empty.

Now lets say I want to custom partitioning my data. So I have created below function to keep keys '1' and '3' in similar partition.

As you can see now that keys 1 and 3 are stored in one partition and rest on other.

I hope this helps. You can try to partitionBy your dataframe. Make sure to convert it into key value pair and keeping key as type integer.

have added few more lines to my explanation

Source Link

edited Dec 19, 2018 at 5:57

Vikrant Singh Rana

4.7k
7
36
81

Now lets say I want to custom partitioning my data. so I have created below function to keep keys '1' and '3' in similar partition.

def partitionFunc(key): import random if key == 1 or key == 3: return 0 else: return random.randint(1,2) newrdd = keypair_rdd.partitionBy(5,lambda k: partitionFunc(int(k[0]))) >>> print("Partitions structure: {}".format(newrdd.glom().map(len).collect())) Partitions structure: [3, 3, 0, 0, 0]

as you can see now that keys 1 and 3 are stored in one partition and rest on other.

I hope this help. you can try to paritionBy your dataframe. make sure to convert it into key value pair and keeping key as type int.

Now lets say I want to custom partitioning my data. so I have created below function to keep keys '1' and '3' in similar partition.

def partitionFunc(key): import random if key == 1 or key == 3: return 0 else: return random.randint(1,2) newrdd = keypair_rdd.partitionBy(5,lambda k: partitionFunc(int(k[0]))) >>> print("Partitions structure: {}".format(newrdd.glom().map(len).collect())) Partitions structure: [3, 3, 0, 0, 0]

as you can see now that keys 1 and 3 are stored in one partition and rest on other.

I hope this help. you can try to paritionBy your dataframe. make sure to convert it into key value pair and keeping key as type int.

Source Link

answered Dec 19, 2018 at 5:02

Vikrant Singh Rana

4.7k
7
36
81

It's happening because you are not applying partitionBy on key-value pair rdd. your rdd must need to be in key-value pair. Also you key type should be int. I don't have sample data for your hive table. so lets demonstrate the fact using below hive table:

I have created a below dataframe using hive table

df = spark.table("udb.emp_details_table"); +------+--------+--------+----------------+ |emp_id|emp_name|emp_dept|emp_joining_date| +------+--------+--------+----------------+ | 1| AAA| HR| 2018-12-06| | 1| BBB| HR| 2017-10-26| | 2| XXX| ADMIN| 2018-10-22| | 2| YYY| ADMIN| 2015-10-19| | 2| ZZZ| IT| 2018-05-14| | 3| GGG| HR| 2018-06-30| +------+--------+--------+----------------+

now I wish to partition my dataframe and want to keep the similar keys in one partition. so I have converted my dataframe to rdd as you can only apply partitionBy on rdd for re-partitioning.

 myrdd = df.rdd newrdd = myrdd.partitionBy(10,lambda k: int(k[0])) newrdd.take(10)

I got the same error:

 File "/usr/hdp/current/spark2-client/python/pyspark/rdd.py", line 1767, in add_shuffle_key for k, v in iterator: ValueError: too many values to unpack

hence, we need to convert our rdd into key-value pair to use paritionBy

keypair_rdd = myrdd.map(lambda x : (x[0],x[1:]))

now you can see that rdd has been converted to key value pair and you can therefore distribute your data in partitions according to keys available.

[(u'1', (u'AAA', u'HR', datetime.date(2018, 12, 6))), (u'1', (u'BBB', u'HR', datetime.date(2017, 10, 26))), (u'2', (u'XXX', u'ADMIN', datetime.date(2018, 10, 22))), (u'2', (u'YYY', u'ADMIN', datetime.date(2015, 10, 19))), (u'2', (u'ZZZ', u'IT', datetime.date(2018, 5, 14))), (u'3', (u'GGG', u'HR', datetime.date(2018, 6, 30)))]

using a paritionBY on key-value rdd now:

newrdd = keypair_rdd.partitionBy(5,lambda k: int(k[0]))

Lets take a look at the partitions. data is grouped and similar keys are stored into similar partitions now. two of them are empty.

>>> print("Partitions structure: {}".format(newrdd.glom().map(len).collect())) Partitions structure: [0, 2, 3, 1, 0]

I hope this help. you can try to paritionBy your dataframe. make sure to convert it into key value pair and keeping key as type int.

Collectives™ on Stack Overflow

Return to Answer