I am trying to understand spark streaming in terms of aggregation principles. Spark DF are based on the mini batches and computations are done on the mini batch that came within a specific time window.
Lets say we have data coming in as -
Window_period_1[Data1, Data2, Data3] Window_period_2[Data4, Data5, Data6] .. then first computation will be done for Window_period_1 and then for Window_period_2. If I need to use the new incoming data along with historic data lets say kind of groupby function between Window_period_new and data from Window_period_1 and Window_period_2, how would I do that?
Another way of seeing the same thing would be lets say if I have a requirement where a few data frames are already created -
df1, df2, df3 and I need to run an aggregation which will involve data from df1, df2, df3 and Window_period_1, Window_period_2, and all new incoming streaming data
how would I do that?