Stream Stream Join Spark Structure Streaming

Question

Below is the use case for Spark Structure Streaming

Step 1: StreamA = loaded from Kafka topicA containing event of type A

Step 2: StreamB = loaded from Kafka topicB containing event of type B

Step 3: JoinedStream = StreamA inner join StreamB on id

Step 4: Insert matched data into Database

I don't need matched data for further processing. Will Spark stream clear the state on joining ?

If not how do I clear them without watermark?

Vindhya G · Accepted Answer · 2021-07-09 07:25:19Z

For inner join although watermark and join conditions are optional as per the doc (https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#inner-joins-with-optional-watermarking)

To avoid unbounded state, you have to define additional join conditions such that indefinitely old inputs cannot match with future inputs and therefore can be cleared from the state. In other words, you will have to do the following additional steps in the join

Any reason not to include watermark and conditions to clear off the state?

Collectives™ on Stack Overflow

Stream Stream Join Spark Structure Streaming

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related