0

I am trying to unify two distincts notebooks into a single one by adapting the logic of construction of two different tables.

The first one reads:

spark.sql(''' SELECT CD_CLI, MAX(VL_RPTD_UTZO) AS MAX_VL_RPTD_UTZO, '2017-01-31' AS DT_MVTC FROM vl_rptd_utzo WHERE DT_EXTC BETWEEN '2016-07-31' AND '2016-12-31' GROUP BY CD_CLI ''').createOrReplaceTempView('vl_rptd_max_utzo_2017_01_31') 

And the second one:

spark.sql('''SELECT CD_CLI, CASE WHEN SUM(in_lim_crt) > 0 THEN ROUND(SUM(SUM_VL_TTL_FAT)/SUM(in_lim_crt), 4) ELSE -99999999999 END AS VL_MED_FAT, '2017-01-31' as DT_MVTC FROM in_lim_fat WHERE DT_MVTC BETWEEN '2016-07-31' AND '2016-12-31' GROUP BY CD_CLI ''').createOrReplaceTempView('media_vl_fatura_2017_01_31') 

My, perhaps naive?, approach was to union the two selects since they are calling the same fields from the same origin:

spark.sql('''SELECT CD_CLI, CASE WHEN SUM(in_lim_crt) > 0 THEN ROUND(SUM(SUM_VL_TTL_FAT)/SUM(in_lim_crt), 4) ELSE -99999999999 END AS VL_MED_FAT, '2017-01-31' as DT_MVTC FROM in_lim_fat WHERE DT_MVTC BETWEEN '2016-07-31' AND '2016-12-31' GROUP BY CD_CLI UNION SELECT CD_CLI, MAX(VL_RPTD_UTZO) AS MAX_VL_RPTD_UTZO, '2017-01-31' AS DT_MVTC FROM vl_rptd_utzo WHERE DT_EXTC BETWEEN '2016-07-31' AND '2016-12-31' GROUP BY CD_CLI ''').createOrReplaceTempView('new_table') 

But when I ask for a describe:

spark.sql('describe new_table').show(10, False) 

The output is:

+----------+-------------+-------+ |col_name |data_type |comment| +----------+-------------+-------+ |CD_CLI |int |null | |VL_MED_FAT|decimal(38,4)|null | |DT_MVTC |string |null | +----------+-------------+-------+ 

Why MAX_VL_RPTD_UTZO is not showing up in the new table? I am new to sql and maybe this is very naive and simple but I cannot solve it.

2
  • 1
    You SELECT 3 columns, and get 3 columns. Just as expected. (The column names are taken from the first SELECT.) Commented Oct 14, 2021 at 20:04
  • 1
    Do select c1, c2, c3, null etc to get 4 columns. And perhaps you want UNION ALL? Commented Oct 14, 2021 at 20:04

1 Answer 1

1

Your first select has CD_CLI, VL_MED_FAT and DT_MVTC

Your second select has CD_CLI, MAX_VL_RPTD_UTZO and DT_MVTC

Spark will use the column name of the first query as the schema and apply it to other subsequence queries in the union, therefore values of MAX_VL_RPTD_UTZO will appear in VL_MED_FAT.

Edit #1: If you want to have 4 columns then it has to be consistence between 2 queries, so something like this

select CD_CLI, VL_MED_FAT, null as MAX_VL_RPTD_UTZO, DT_MVTC from ... union select CD_CLI, null as VL_MED_FAT, MAX_VL_RPTD_UTZO, DT_MVTC from ... 
Sign up to request clarification or add additional context in comments.

4 Comments

I see. Can I have values of MAX_VL_RPTD_UTZO not in VL_MED_FAT but appearing independently? I mean, can I have 4 columns by using another query method othar than UNION?
Yes of course, check my updated answer
ok!!! thanks! I didn't know I could these "null AS..." to construct the table. So it was very simple as I thought. Thanks a lot.
well you used '2017-01-31' as DT_MVTC to construct the table, so it's essentially the same :)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.