0

I was playing with the Moby Word's list using Apache Spark, here is the file. I first created an RDD using this textfile

 lines = sc.textFile("words.txt") 

And then created two RDDs containing words which have "p" and "s" in them

 plines = lines.filter(lambda x: "p" in x) slines = lines.filter(lambda x: "s" in x) 

and then created a union of these two

 union_list = slines.union(plines) 

I then counted the number of words in each list with the "count" method and that came out as 64803, 22969 and 87772 for slines, plines and union_list respectively. Also 64803+22969=87772, which means there are no words with both "p" and "s". I created a new RDD containing words with "p" and "s" using

 pslines = lines.filter(lambda x: ("p" in x) and ("s" in x)) 

and counted the elements which gave 13616, and then created a new RDD containing words with "p" or "s"

 newlist = lines.filter(lambda x: ("p" in x) or ("s" in x)) 

and counted the elements which gave 74156, which makes sense cause 64803+22969-13616=74156. What did I do wrong with the union method? I am using Spark 1.6 on Windows 10 and Python 3.5.1.

1 Answer 1

2

union() method is not a set union operation. It just concatenats two RDD's, so the intersection will be counted twice. If you want the true set union, you need to run distinct() on your resulting RDD:

union_list = slines.union(plines).distinct()

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.