How can I split an array of structs into columns in Spark?

Question

I have an column containing an array of phone numbers represented as structs, and need to put them into three columns by a "type" attribute (phone1, phone2, fax).

Here are two sample values of the column.

[{"number":"000-000-0000","type":"Phone1"},{"number":"000-000-0001","type":"Phone2"},{"number":"000-000-0002","type":"Fax"}] [{"number":"000-000-1000","type":"Phone1"},{"number":"000-000-1001","typeCode":"Fax"},{"number":"000-000-1002","type":"Phone2"}]

I want to split each into three columns, one for each type. I want something like this:

 Phone1 Phone2 Fax 000-000-0000 000-000-0001 000-000-0002 000-000-1000 000-000-1002 000-000-1001

This answer shows how to put each element of the array into its own column. How to explode an array into multiple columns in Spark

This gets me halfway there, but I can't rely on the order of items in the array. If I do this I'll get something like this, where the Phone2 and Fax values in the second column are out of place.

 Phone1 Phone2 Fax 000-000-0000 000-000-0001 000-000-0002 000-000-1000 000-000-1001 000-000-1002

How can I split the single column value into three columns, using the type value? An array can have 0-3 numbers, but will never have more than one number of each type.

Leo C · Accepted Answer · 2019-04-04 23:58:53Z

Here's one way which involves flattening the phone/fax#s via explode, followed by pivoting on the typeCode, as shown in the following example:

case class Contact(number: String, typeCode: String) val df = Seq( (1, Seq(Contact("111-22-3333", "Phone1"), Contact("111-44-5555", "Phone2"), Contact("111-66-7070", "Fax"))), (2, Seq(Contact("222-33-4444", "Phone1"), Contact("222-55-6060", "Fax"), Contact("111-77-8888", "Phone2"))) ).toDF("user_id", "contacts") df. withColumn("contact", explode($"contacts")). groupBy($"user_id").pivot($"contact.typeCode").agg(first($"contact.number")). show(false) // +-------+-----------+-----------+-----------+ // |user_id|Fax |Phone1 |Phone2 | // +-------+-----------+-----------+-----------+ // |1 |111-66-7070|111-22-3333|111-44-5555| // |2 |222-55-6060|222-33-4444|111-77-8888| // +-------+-----------+-----------+-----------+

What if the DF contains other columns, which you want to keep? One can add them into the groupBy clause, but that doesn't work with unorderable types like a MapType.. The only solution I can think of involves joining those columns back to the pivoted DF.. all fairly expensive operations..

Collectives™ on Stack Overflow

How can I split an array of structs into columns in Spark?

1 Answer 1

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Linked

Related