1

I have an column containing an array of phone numbers represented as structs, and need to put them into three columns by a "type" attribute (phone1, phone2, fax).

Here are two sample values of the column.

[{"number":"000-000-0000","type":"Phone1"},{"number":"000-000-0001","type":"Phone2"},{"number":"000-000-0002","type":"Fax"}] [{"number":"000-000-1000","type":"Phone1"},{"number":"000-000-1001","typeCode":"Fax"},{"number":"000-000-1002","type":"Phone2"}] 

I want to split each into three columns, one for each type. I want something like this:

 Phone1 Phone2 Fax 000-000-0000 000-000-0001 000-000-0002 000-000-1000 000-000-1002 000-000-1001 

This answer shows how to put each element of the array into its own column. How to explode an array into multiple columns in Spark

This gets me halfway there, but I can't rely on the order of items in the array. If I do this I'll get something like this, where the Phone2 and Fax values in the second column are out of place.

 Phone1 Phone2 Fax 000-000-0000 000-000-0001 000-000-0002 000-000-1000 000-000-1001 000-000-1002 

How can I split the single column value into three columns, using the type value? An array can have 0-3 numbers, but will never have more than one number of each type.

1 Answer 1

1

Here's one way which involves flattening the phone/fax#s via explode, followed by pivoting on the typeCode, as shown in the following example:

case class Contact(number: String, typeCode: String) val df = Seq( (1, Seq(Contact("111-22-3333", "Phone1"), Contact("111-44-5555", "Phone2"), Contact("111-66-7070", "Fax"))), (2, Seq(Contact("222-33-4444", "Phone1"), Contact("222-55-6060", "Fax"), Contact("111-77-8888", "Phone2"))) ).toDF("user_id", "contacts") df. withColumn("contact", explode($"contacts")). groupBy($"user_id").pivot($"contact.typeCode").agg(first($"contact.number")). show(false) // +-------+-----------+-----------+-----------+ // |user_id|Fax |Phone1 |Phone2 | // +-------+-----------+-----------+-----------+ // |1 |111-66-7070|111-22-3333|111-44-5555| // |2 |222-55-6060|222-33-4444|111-77-8888| // +-------+-----------+-----------+-----------+ 
Sign up to request clarification or add additional context in comments.

1 Comment

What if the DF contains other columns, which you want to keep? One can add them into the groupBy clause, but that doesn't work with unorderable types like a MapType.. The only solution I can think of involves joining those columns back to the pivoted DF.. all fairly expensive operations..

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.