5

I am trying to understand spark streaming in terms of aggregation principles. Spark DF are based on the mini batches and computations are done on the mini batch that came within a specific time window.

Lets say we have data coming in as -

 Window_period_1[Data1, Data2, Data3] Window_period_2[Data4, Data5, Data6] .. 

then first computation will be done for Window_period_1 and then for Window_period_2. If I need to use the new incoming data along with historic data lets say kind of groupby function between Window_period_new and data from Window_period_1 and Window_period_2, how would I do that?

Another way of seeing the same thing would be lets say if I have a requirement where a few data frames are already created -

df1, df2, df3 and I need to run an aggregation which will involve data from df1, df2, df3 and Window_period_1, Window_period_2, and all new incoming streaming data

how would I do that?

1 Answer 1

2

Spark allows you to store state in rdd (with checkpoints). So, even after restart, job will restore it state from checkpoint and continie streaming.

However, we faced with performance problems with checkpoint (specially, after restoring state), so it is worth to implement storint state using some external source (like hbase)

Sign up to request clarification or add additional context in comments.

1 Comment

To bolster Natalia's answer, there are a number of Datastores that either connect to or are integrated with spark and can store state for aggregations (if checkpointing doesn't work well). Hbase is one of them. There is also SnappyData, Cassandra, redis, and MemSQL to name a few. All with various pros and cons.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.