3

I am trying to see if I can streaming data to Redshift using spark structured streaming (v2.2), I found the spark-redshift library (https://github.com/databricks/spark-redshift). However, it only works in batch mode. Any other suggestions on how to do it with streaming data? How is the performance for COPY to Redshift?

Appreciate!

1 Answer 1

3

For low volumes of data (a few rows of data occasionally) it is OK to use:

insert into table ... update table ... delete from table ... 

commands to maintain redshift data. This is how spark streaming would likely work.

However, for larger volumes you must always: 1) write data to s3, preferably chunked up into 1MB to 1GB files, preferable gzipped. 2) run redshift copy command to load that s3 data into redshift "staging" area 3) run redshift sql to merge the staging data into your target tables.

using this copy method could be hundreds of times more efficient than individual inserts.

This means of course, you really have to run in batch mode.

You can run the batch update every few minutes to keep redshift data latency low.

Sign up to request clarification or add additional context in comments.

1 Comment

really great Answer. It helped for shaping my project Architecture

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.