0

How to make this code more efficient in Spark?
I need to calculate minimum, maximum, count, mean from data.
Here is my sample data,

Name Shop Money
A Shop001 99.99
A Shop001 87.15
B Shop001 3.99
...

Now I try to organize my data to generate mean, min, max, count by Name+Shop (key).
Then get the result by collect().
Here is my code in spark,

 

def tupleDivide(y): return float(y[0])/y[1] def smin(a, b): return min(a, b) def smax(a, b): return max(a, b) raw = sgRDD.map(lambda x: getVar(parserLine(x),list_C+list_N)).cache() cnt = raw.map(lambda (x,y,z): (x+"_"+y, 1)).countByKey() sum = raw.map(lambda (x,y,z): (x+"_"+y, z)).reduceByKey(add) min = raw.map(lambda (x,y,z): (x+"_"+y, z)).reduceByKey(smin) max = raw.map(lambda (x,y,z): (x+"_"+y, z)).reduceByKey(smax) raw_cntRDD = sc.parallelize(cnt.items(),3) raw_mean = sum.join(raw_cntRDD).map(lambda (x, y): (x, tupleDivide(y)))

Would anyone provide some suggestion about the elegant coding style?
Thanks!

1 Answer 1

2

You should use aggregateByKey for more optimal processing. The idea is that you store state vector which consists of count, min, max, and sum, and use aggregation functions to get the final values. Also, you can use tuple as a key, it is not necessary to concatenate keys into a single string.

data = [ ['x', 'shop1', 1], ['x', 'shop1', 2], ['x', 'shop2', 3], ['x', 'shop2', 4], ['x', 'shop3', 5], ['y', 'shop4', 6], ['y', 'shop4', 7], ['y', 'shop4', 8] ] def add(state, x): state[0] += 1 state[1] = min(state[1], x) state[2] = max(state[2], x) state[3] += x return state def merge(state1, state2): state1[0] += state2[0] state1[1] = min(state1[1], state2[1]) state1[2] = max(state1[2], state2[2]) state1[3] += state2[3] return state1 res = sc.parallelize(data).map(lambda x: ((x[0], x[1]), x[2])).aggregateByKey([0, 10000, 0, 0], add, merge) for x in res.collect(): print 'Client "%s" shop "%s" : count %d min %f max %f avg %f' % ( x[0][0], x[0][1], x[1][0], x[1][1], x[1][2], float(x[1][3])/float(x[1][0]) ) 
Sign up to request clarification or add additional context in comments.

1 Comment

I got it! Thanks a lot!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.