How to optimize this code on spark?

Question

How to make this code more efficient in Spark?
I need to calculate minimum, maximum, count, mean from data.
Here is my sample data,

Name Shop Money
A Shop001 99.99
A Shop001 87.15
B Shop001 3.99
...

Now I try to organize my data to generate mean, min, max, count by Name+Shop (key).
Then get the result by collect().
Here is my code in spark,

 
 def tupleDivide(y): return float(y[0])/y[1] def smin(a, b): return min(a, b) def smax(a, b): return max(a, b) raw = sgRDD.map(lambda x: getVar(parserLine(x),list_C+list_N)).cache() cnt = raw.map(lambda (x,y,z): (x+"_"+y, 1)).countByKey() sum = raw.map(lambda (x,y,z): (x+"_"+y, z)).reduceByKey(add) min = raw.map(lambda (x,y,z): (x+"_"+y, z)).reduceByKey(smin) max = raw.map(lambda (x,y,z): (x+"_"+y, z)).reduceByKey(smax) raw_cntRDD = sc.parallelize(cnt.items(),3) raw_mean = sum.join(raw_cntRDD).map(lambda (x, y): (x, tupleDivide(y)))

Would anyone provide some suggestion about the elegant coding style?
Thanks!

0x0FFF · Accepted Answer · 2015-01-27 09:20:38Z

You should use aggregateByKey for more optimal processing. The idea is that you store state vector which consists of count, min, max, and sum, and use aggregation functions to get the final values. Also, you can use tuple as a key, it is not necessary to concatenate keys into a single string.

data = [ ['x', 'shop1', 1], ['x', 'shop1', 2], ['x', 'shop2', 3], ['x', 'shop2', 4], ['x', 'shop3', 5], ['y', 'shop4', 6], ['y', 'shop4', 7], ['y', 'shop4', 8] ] def add(state, x): state[0] += 1 state[1] = min(state[1], x) state[2] = max(state[2], x) state[3] += x return state def merge(state1, state2): state1[0] += state2[0] state1[1] = min(state1[1], state2[1]) state1[2] = max(state1[2], state2[2]) state1[3] += state2[3] return state1 res = sc.parallelize(data).map(lambda x: ((x[0], x[1]), x[2])).aggregateByKey([0, 10000, 0, 0], add, merge) for x in res.collect(): print 'Client "%s" shop "%s" : count %d min %f max %f avg %f' % ( x[0][0], x[0][1], x[1][0], x[1][1], x[1][2], float(x[1][3])/float(x[1][0]) )

Collectives™ on Stack Overflow

How to optimize this code on spark?

1 Answer 1

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Linked

Related