4

As I read Spark/Hive SQL documentation is appears that Insert into table with a Column List is not supported in Spark 2.4 and earlier versions.

I have a source table and a destination table with different number of columns and different column names which I need to copy.

Does this mean I have to code this in PySpark to do this job as Spark SQL will not be able to do it ??

Example:

input_table( cola, colb, colc, cold, cole) output_table(fieldx, fieldy, fieldz) 

In SQL (assuming RDBMS such as MS-SQL, PostgreSQL etc) I would do the following:

insert into output_table(fieldx, fieldy, fieldz) select cola, colb, colc from input_table 

Spark SQL does not allow this, it does not accept a column list in Insert SQL statement.

Question: how can I do this task with minimum of code and maximum performance in either PySpark or (ideally) in Spark-SQL (I am using Spark 2.4) ?

thank you

1
  • 1
    Is this something that is updated in a later version of Spark (> 2.4) where you can specify columns? Not supporting this basic sql syntax seems pretty poor... Commented Jan 23, 2020 at 18:38

1 Answer 1

7

Specify the columns in output that won't be copied from input_table as null in select. (This is what would happen when only a set of columns, not all, would be inserted with a column list, if it were allowed)

insert into output_table select cola, colb, colc,null as other1,--..specify non-copied column values as null from input_table 
Sign up to request clarification or add additional context in comments.

1 Comment

thanks, this is what I started to do, will this perform optimally? I need to use some SQL functions/expressions on some of the select columns to transform them such as to uppercase some values, to extract some sub-strings, cast other columns, etc. The data source is huge, many billions of rows.....

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.