How to make this code more efficient in Spark?
I need to calculate minimum, maximum, count, mean from data.
Here is my sample data,
Name Shop Money
A Shop001 99.99
A Shop001 87.15
B Shop001 3.99
...
Now I try to organize my data to generate mean, min, max, count by Name+Shop (key).
Then get the result by collect().
Here is my code in spark,
def tupleDivide(y): return float(y[0])/y[1] def smin(a, b): return min(a, b) def smax(a, b): return max(a, b) raw = sgRDD.map(lambda x: getVar(parserLine(x),list_C+list_N)).cache() cnt = raw.map(lambda (x,y,z): (x+"_"+y, 1)).countByKey() sum = raw.map(lambda (x,y,z): (x+"_"+y, z)).reduceByKey(add) min = raw.map(lambda (x,y,z): (x+"_"+y, z)).reduceByKey(smin) max = raw.map(lambda (x,y,z): (x+"_"+y, z)).reduceByKey(smax) raw_cntRDD = sc.parallelize(cnt.items(),3) raw_mean = sum.join(raw_cntRDD).map(lambda (x, y): (x, tupleDivide(y))) Would anyone provide some suggestion about the elegant coding style?
Thanks!