I have the raw data DataFrame like that:
+-----------+--------------------+------+ |device | timestamp | value| +-----------+--------------------+------+ | device_A|2022-01-01 18:00:01 | 100| | device_A|2022-01-01 18:00:02 | 99| | device_A|2022-01-01 18:00:03 | 100| | device_A|2022-01-01 18:00:04 | 102| | device_A|2022-01-01 18:00:05 | 100| | device_A|2022-01-01 18:00:06 | 99| | device_A|2022-01-01 18:00:11 | 98| | device_A|2022-01-01 18:00:12 | 100| | device_A|2022-01-01 18:00:13 | 100| | device_A|2022-01-01 18:00:15 | 101| | device_A|2022-01-01 18:00:17 | 101| I'd like to aggregate them and to build the listed 10 s aggregation like that:
+-----------+--------------------+------------+-------+ |device | windowtime | values| counts| +-----------+--------------------+------------+-------+ | device_A|2022-01-01 18:00:00 |[99,100,102]|[1,3,1]| | device_A|2022-01-01 18:00:10 |[98,100,101]|[1,2,2]| To plot a heat-map graph of the values later.
I have succeed with getting the values column but not clear how to calculate the corresponding counts
.withColumn("values",collect_list(col("value")).over(Window.partitionBy($"device").orderBy($"timestamp".desc))) How can I do the weighted list aggregation in Apache Spark?