I am trying to insert in Redshift data from S3 (parquet files). Doing it through SQLWorkbench it takes 46 seconds for 6 million rows. But doing it through the connector spark-redshift it takes about 7 minutes.
I am trying it with more nodes and getting same result.
Any suggestions to improve the time using spark-redshift?
The code in Spark:
val df = spark.read.option("basePath", "s3a://parquet/items").parquet("s3a://parquet/items/Year=2017/Month=7/Day=15") df.write .format("com.databricks.spark.redshift") .option("url", "jdbc:....") .option("dbtable", "items") .option("tempdir", "s3a://parquet/temp") .option("aws_iam_role", "...") .option("sortkeyspec", "SORTKEY(id)") .mode(SaveMode.Append) .save() The code in SQLWorkbench (Redshift SQL):
CREATE EXTERNAL TABLE items_schema.parquet_items("id type, column2 type....") ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS PARQUET LOCATION 's3://parquet/items/Year=2017/Month=7/Day=15'; CREATE TABLE items ("id type, column2 type...."); INSERT INTO items (SELECT * FROM items_schema.parquet_items);
spark-redshift.