Spark Structured Streaming - multiple aggregations without re-reading data

Question

I'm investigating using Apache Spark for an application. I'm especially interested in the structured streaming mode using temporary views and full SQL queries (for simplicity and low latency).

The application would require running multiple (tens, possibly hundreds) of queries on a single input stream of data. Is there a way to avoid having Spark re-read the input for each query?

The question is not clear. Multiple aggregations in "single" streaming query should be possible, "if the group key is same". (Just add all aggregations in "agg()".) If you're talking about arbitrary kinds of aggregations like different group keys, sorry but they should be separate streaming queries and input should be re-read. The workaround is the Mike's answer below. The concept of forking/splitting stream don't exist in Spark, whereas other streaming frameworks exist; so your next bet would be trying out Flink to check whether your requirement is satisfied. — Jungtaek Lim
– Jungtaek Lim, Commented Jan 23, 2021 at 3:01

Michael Heil · Accepted Answer · 2021-01-20 06:42:22Z

Multiple streaming queries within the same Spark Structured Streaming application will run concurrently and independent in the sense that they make different progress while reading the same source. Therefore, caching/persisting will not be feasible (and is actually not possible).

Unless you use the following foreachBatch pattern for your streaming queries there is no standard way to cache the input source.

streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) => batchDF.persist() batchDF.write.format(...).save(...) // location 1 batchDF.write.format(...).save(...) // location 2 batchDF.unpersist() }

More details are given in the Structured Streaming Programming Guide on using foreach and foreachbatch

Just to be clear, if not using foreachbatch, the multiple queries will still use the same microbatch is my understanding, or is that not so?

Collectives™ on Stack Overflow

Spark Structured Streaming - multiple aggregations without re-reading data

1 Answer 1

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Related