Applying aggregating function with spark streaming scala

Question

I need to apply an aggregation function on a stream of data with apache spark streaming (NO APACHE SPARK STREAMING SQL).

In my case I have a kafka producer tha send messages in JSON format. The format is {'a': String, 'b': String, 'c': Integer, 'd': Double}

I need to aggregate on attributes 'a' and 'b' every 5 Seconds and I have to apply an aggregation function on the other 2 attributes (e.g. Average, or Sum, or Min, or Max).

How can I do that?

Thanks

Have you already tried the reduce function? spark.apache.org/docs/latest/… — maasg
– maasg, Commented Jun 15, 2017 at 13:25
the problem is that the reduce function take 2 parameters and return 1. I need to have the same schema. I other words if my initial schema is {'a': String, 'b': String, 'c': Integer, 'd': Double}the resulting schema (with an AVG aggregate function) should be {'GROUPBYa': String, 'GROUPBYb': String, 'AVGc': Integer, 'AVGd': Double} — lu_ferra
– lu_ferra, Commented Jun 15, 2017 at 13:43
you could also use transform or foreachRDD and apply any arbitrary RDD function, or convert to Dataframes and use the dataframes aggregation API — maasg
– maasg, Commented Jun 15, 2017 at 13:47

maasg · Accepted Answer · 2017-06-16 07:48:22Z

To get you started, you could approach aggregation like this:

import sparkSession.implicits._ jsonDstream.foreachRDD{jsonRDD => val df = sparkSession.read.json(jsonRDD) val aggr = df.groupBy($"a", $"b").agg(avg($"c")) ... do something with aggr ... }

Collectives™ on Stack Overflow

Applying aggregating function with spark streaming scala

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related