3

I am trying to insert in Redshift data from S3 (parquet files). Doing it through SQLWorkbench it takes 46 seconds for 6 million rows. But doing it through the connector spark-redshift it takes about 7 minutes.

I am trying it with more nodes and getting same result.

Any suggestions to improve the time using spark-redshift?

The code in Spark:

val df = spark.read.option("basePath", "s3a://parquet/items").parquet("s3a://parquet/items/Year=2017/Month=7/Day=15") df.write .format("com.databricks.spark.redshift") .option("url", "jdbc:....") .option("dbtable", "items") .option("tempdir", "s3a://parquet/temp") .option("aws_iam_role", "...") .option("sortkeyspec", "SORTKEY(id)") .mode(SaveMode.Append) .save() 

The code in SQLWorkbench (Redshift SQL):

CREATE EXTERNAL TABLE items_schema.parquet_items("id type, column2 type....") ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS PARQUET LOCATION 's3://parquet/items/Year=2017/Month=7/Day=15'; CREATE TABLE items ("id type, column2 type...."); INSERT INTO items (SELECT * FROM items_schema.parquet_items); 
5
  • check on aws console - redshift to see how long the redshift copy command are taking via each approach - also check the number of copy commands generated. Commented Feb 7, 2018 at 11:38
  • Are you using Spectrum to read the parquet as an external table in Redshift? Please provide an example of what you are running in SQLWorkbench and what you are running when you use spark-redshift. Commented Feb 7, 2018 at 14:40
  • @JoeHarris I edited the question and I added the code in order to provide more information. Commented Feb 8, 2018 at 8:32
  • What is spark cluster and redshift cluster configuration Commented Aug 13, 2019 at 21:30
  • I found myself doing something similar. However I was advised to write out the parquet and use the copy command to read large files. The reasoning was that redshift copy command is optimized to scan data by columns and compression helps with this efficiency. I do not know the details of how the spark-redshift writes Commented Oct 8, 2019 at 16:27

2 Answers 2

3

I would say your snippets are mislabelled:

  • This is Spark code val df = spark.read…
  • This is Redshift SQL CREATE EXTERNAL TABLE…

When you use the external table (Redshift Spectrum) it does the following:

  • Read the parquet data in the location defined.
  • Insert the data into a normal Redshift table as shown.

When you use the Spark code to write the data to Redshift, using spark-redshift, it does the following:

  • Spark reads the parquet files from S3 into the Spark cluster.
  • Spark converts the parquet data to Avro format and writes it to S3.
  • Spark issues a COPY SQL query to Redshift to load the data.
  • Redshift loads the Avro data from S3 to the final table.

Basically the Spark code is doing a lot more work, reading the data twice and writing it twice in different formats. The Redshift Spectrum SQL is reading the data once and writing it once into Redshift itself (much faster than sending to S3 over the network).

Sign up to request clarification or add additional context in comments.

3 Comments

sorry for the mislabelled, I have edited it. Thanks for your answer.
Although marked as the answer, I am not comfortable with the explanation on "When you use the external table (Redshift Spectrum) it does the following". ---- spectrum internally uses a cluster called "spectrum workers", which I suspect is a hidden EMR cluster/Athena running hive LLP. So it depends on how the task is being distrubuted and communicated - redshift may be performing some super awesome optimizations at these two points that the spark code on the EMR may not have been tuned for
Since this library reads and writes data to S3 when transferring data to/from Redshift. I could see in some post that you can optimize the Spark session to work with S3 using some hadoop configuration like: spark.sparkContext.hadoopConfiguration.set("fs.s3a.fast.upload", "true"). Any suggestions or comments about this?
0

Also, try to use CSV and not Avro (which is the default) should be faster:

Redshift is significantly faster when loading CSV than when loading Avro files, so using that tempformat may provide a large performance boost when writing to Redshift.

https://docs.databricks.com/spark/latest/data-sources/aws/amazon-redshift.html

1 Comment

thanks, I also tried it but it is an experimental option.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.