How to merge rows using SQL only?

Question

I can neither use pyspark or scala. I can only write SQL code. I have a table with 2 columns item id, name.

item_id, name 1 name1 1 name2 1 name3 2 name4 2 name5

I want to generate results with the names of an item_id concatenated.

item_id, names 1 name1-name2-name3 2 name4-name5

How do I create such a table with Spark sql?

Does this answer your question? pyspark collect_set or collect_list with groupby — yahoo
– yahoo, Commented Oct 15, 2020 at 7:40

Jacek Laskowski · Accepted Answer · 2020-10-15 20:27:43Z

The beauty of Spark SQL is that once you have a solution in any of the supported languages (Scala, Java, Python, R or SQL) you can somewhat figure out other variants.

The following SQL statement seems doing what you ask for:

SELECT item_id, array_join(collect_list(name), '-') as names FROM tableName GROUP BY item_id

In spark-shell it gives the following result:

scala> sql("select item_id, array_join(collect_list(name), '-') as names from so group by item_id").show +-------+-----------------+ |item_id| names| +-------+-----------------+ | 1|name1-name2-name3| | 2| name4-name5| +-------+-----------------+

Fahmi · Accepted Answer · 2020-10-15 06:08:39Z

You can try the below -

df.orderBy('names', ascending=False) .groupBy('item_id') .agg( array_join( collect_list('names'), delimiter='-', ).alias('names') )

Adam · Accepted Answer · 2020-10-15 11:43:58Z

You can use Spark data frame's groupBy and agg methods and concat_ws function:

df.groupBy($"item_id").agg(concat_ws("-", collect_list($"name")).alias("names")).show()

Group fields by item_id and aggregating each name field by concatenating them together.

Collectives™ on Stack Overflow

How to merge rows using SQL only?

3 Answers 3

Comments

1 Comment

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

1 Comment

Linked

Related