I was playing with the Moby Word's list using Apache Spark, here is the file. I first created an RDD using this textfile
lines = sc.textFile("words.txt") And then created two RDDs containing words which have "p" and "s" in them
plines = lines.filter(lambda x: "p" in x) slines = lines.filter(lambda x: "s" in x) and then created a union of these two
union_list = slines.union(plines) I then counted the number of words in each list with the "count" method and that came out as 64803, 22969 and 87772 for slines, plines and union_list respectively. Also 64803+22969=87772, which means there are no words with both "p" and "s". I created a new RDD containing words with "p" and "s" using
pslines = lines.filter(lambda x: ("p" in x) and ("s" in x)) and counted the elements which gave 13616, and then created a new RDD containing words with "p" or "s"
newlist = lines.filter(lambda x: ("p" in x) or ("s" in x)) and counted the elements which gave 74156, which makes sense cause 64803+22969-13616=74156. What did I do wrong with the union method? I am using Spark 1.6 on Windows 10 and Python 3.5.1.