0

I need to apply an aggregation function on a stream of data with apache spark streaming (NO APACHE SPARK STREAMING SQL).

In my case I have a kafka producer tha send messages in JSON format. The format is {'a': String, 'b': String, 'c': Integer, 'd': Double}

I need to aggregate on attributes 'a' and 'b' every 5 Seconds and I have to apply an aggregation function on the other 2 attributes (e.g. Average, or Sum, or Min, or Max).

How can I do that?

Thanks

3
  • Have you already tried the reduce function? spark.apache.org/docs/latest/… Commented Jun 15, 2017 at 13:25
  • the problem is that the reduce function take 2 parameters and return 1. I need to have the same schema. I other words if my initial schema is {'a': String, 'b': String, 'c': Integer, 'd': Double}the resulting schema (with an AVG aggregate function) should be {'GROUPBYa': String, 'GROUPBYb': String, 'AVGc': Integer, 'AVGd': Double} Commented Jun 15, 2017 at 13:43
  • you could also use transform or foreachRDD and apply any arbitrary RDD function, or convert to Dataframes and use the dataframes aggregation API Commented Jun 15, 2017 at 13:47

1 Answer 1

1

To get you started, you could approach aggregation like this:

import sparkSession.implicits._ jsonDstream.foreachRDD{jsonRDD => val df = sparkSession.read.json(jsonRDD) val aggr = df.groupBy($"a", $"b").agg(avg($"c")) ... do something with aggr ... } 
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.