Apache Spark union method giving inexplicable result

Question

I was playing with the Moby Word's list using Apache Spark, here is the file. I first created an RDD using this textfile

 lines = sc.textFile("words.txt")

And then created two RDDs containing words which have "p" and "s" in them

 plines = lines.filter(lambda x: "p" in x) slines = lines.filter(lambda x: "s" in x)

and then created a union of these two

 union_list = slines.union(plines)

I then counted the number of words in each list with the "count" method and that came out as 64803, 22969 and 87772 for slines, plines and union_list respectively. Also 64803+22969=87772, which means there are no words with both "p" and "s". I created a new RDD containing words with "p" and "s" using

 pslines = lines.filter(lambda x: ("p" in x) and ("s" in x))

and counted the elements which gave 13616, and then created a new RDD containing words with "p" or "s"

 newlist = lines.filter(lambda x: ("p" in x) or ("s" in x))

and counted the elements which gave 74156, which makes sense cause 64803+22969-13616=74156. What did I do wrong with the union method? I am using Spark 1.6 on Windows 10 and Python 3.5.1.

pavel · Accepted Answer · 2016-01-24 07:21:17Z

union() method is not a set union operation. It just concatenats two RDD's, so the intersection will be counted twice. If you want the true set union, you need to run distinct() on your resulting RDD:

union_list = slines.union(plines).distinct()

Collectives™ on Stack Overflow

Apache Spark union method giving inexplicable result

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related