I am learning spark, and when I tested repartition() function in pyspark shell with the following expression, I observed a very strange result: all elements fall into the same partition after repartition() function. Here, I used glom() to learn about the partitioning within the rdd. I was expecting repartition() to shuffle the elements and randomly distribute them among partitions. This only happens when I repartition with new number of partitions <= original partitions.
During my test, if I set new number of partitions > original partitions, there is also no shuffling observed. Am I doing anything wrong here?
In [1]: sc.parallelize(range(20), 8).glom().collect() Out[1]: [[0, 1], [2, 3], [4, 5], [6, 7, 8, 9], [10, 11], [12, 13], [14, 15], [16, 17, 18, 19]] In [2]: sc.parallelize(range(20), 8).repartition(8).glom().collect() Out[2]: [[], [], [], [], [], [], [2, 3, 6, 7, 8, 9, 14, 15, 16, 17, 18, 19, 0, 1, 12, 13, 4, 5, 10, 11], []] In [3]: sc.parallelize(range(20), 8).repartition(10).glom().collect() Out[3]: [[], [0, 1], [14, 15], [10, 11], [], [6, 7, 8, 9], [2, 3], [16, 17, 18, 19], [12, 13], [4, 5]] I am using spark version 2.1.1.