Convert Array form (as String) to Column in Pyspark

Question

I have a df with this form:

+---------------------------------------+ |ID|ESTRUC_COMP | +---------------------------------------+ |4A|{'AP': '201', 'BQ': '2'} | |8B| {'AP': '501', 'BQ': '1', 'IN': '5'}| +---------------------------------------+

And I need something like this:

+------------------------------------------------+ |ID|ESTRUC_COMP |AP |BQ|IN| +------------------------------------------------+ |4A|{'AP': '201', 'BQ': '2'} |201|2 | | |8B|{'AP': '501', 'BQ': '1', 'IN': '5'}|501|1 |5 | +------------------------------------------------+

But, ESTRUC_COMP is a String.

root |-- ID: string (nullable = true) |-- ESTRUC_COMP: string (nullable = true)

How can I perform this transformation? Thank you in advance.

Boris

Maybe it is possible to transform the column to struct type and then apply the solution in stackoverflow.com/questions/47874037/… — Boris Perez
– Boris Perez, Commented Jul 30, 2018 at 16:07
Dont' get it... I dont have a dict in column, I have a String column. I need to transforme it to dict or try to do the work using String — Boris Perez
– Boris Perez, Commented Jul 30, 2018 at 21:09

pault · Accepted Answer · 2018-07-30 22:13:58Z

Since you're using Spark 1.6, you can't use pyspark.sql.functions.from_json() - you're going to have to use a udf.

This question is very similar to PySpark “explode” dict in column, but I accept it's not a dupe for 2 reasons:

Your string column is not valid JSON (because of single quotes)
You want the keys to become columns

Nevertheless, the first step is to basically follow the same steps in the linked post with a minor tweak to the parse() function which replaces single quotes with double quotes:

from pyspark.sql.functions import udf, explode, first from pyspark.sql.types import * import json def parse(s): try: return json.loads(s.replace("'", '"')) except json.JSONDecodeError: pass parse_udf = udf(parse, MapType(StringType(), StringType()))

Now you can parse the string and call pyspark.sql.functions.explode():

df.select("ID", explode(parse_udf("ESTRUC_COMP"))).show() #+---+---+-----+ #| ID|key|value| #+---+---+-----+ #| 4A| BQ| 2| #| 4A| AP| 201| #| 8B| IN| 5| #| 8B| BQ| 1| #| 8B| AP| 501| #+---+---+-----+

Finally, pivot() to get the keys as the columns. You can use first() as the aggregate function because we know that the key-value relationship is one-to-one for each ID.

df.select("*", explode(parse_udf("ESTRUC_COMP")))\ .groupBy("ID","ESTRUC_COMP").pivot("key").agg(first("value")).show(truncate=False) #+---+-----------------------------------+---+---+----+ #|ID |ESTRUC_COMP |AP |BQ |IN | #+---+-----------------------------------+---+---+----+ #|4A |{'AP': '201', 'BQ': '2'} |201|2 |null| #|8B |{'AP': '501', 'BQ': '1', 'IN': '5'}|501|1 |5 | #+---+-----------------------------------+---+---+----+

Of course, since I defined the udf to return MapType(StringType(), StringType()), all of your resultant columns are going to be strings. You can either cast them or modify the udf accordingly.

Collectives™ on Stack Overflow

Convert Array form (as String) to Column in Pyspark

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related