0

I have a df with this form:

+---------------------------------------+ |ID|ESTRUC_COMP | +---------------------------------------+ |4A|{'AP': '201', 'BQ': '2'} | |8B| {'AP': '501', 'BQ': '1', 'IN': '5'}| +---------------------------------------+ 

And I need something like this:

+------------------------------------------------+ |ID|ESTRUC_COMP |AP |BQ|IN| +------------------------------------------------+ |4A|{'AP': '201', 'BQ': '2'} |201|2 | | |8B|{'AP': '501', 'BQ': '1', 'IN': '5'}|501|1 |5 | +------------------------------------------------+ 

But, ESTRUC_COMP is a String.

root |-- ID: string (nullable = true) |-- ESTRUC_COMP: string (nullable = true) 

How can I perform this transformation? Thank you in advance.

Boris

4
  • Maybe it is possible to transform the column to struct type and then apply the solution in stackoverflow.com/questions/47874037/… Commented Jul 30, 2018 at 16:07
  • What version of spark? Commented Jul 30, 2018 at 16:42
  • Hi @pault, 1.6.1 Commented Jul 30, 2018 at 17:30
  • Dont' get it... I dont have a dict in column, I have a String column. I need to transforme it to dict or try to do the work using String Commented Jul 30, 2018 at 21:09

1 Answer 1

1

Since you're using Spark 1.6, you can't use pyspark.sql.functions.from_json() - you're going to have to use a udf.

This question is very similar to PySpark “explode” dict in column, but I accept it's not a dupe for 2 reasons:

  1. Your string column is not valid JSON (because of single quotes)

  2. You want the keys to become columns

Nevertheless, the first step is to basically follow the same steps in the linked post with a minor tweak to the parse() function which replaces single quotes with double quotes:

from pyspark.sql.functions import udf, explode, first from pyspark.sql.types import * import json def parse(s): try: return json.loads(s.replace("'", '"')) except json.JSONDecodeError: pass parse_udf = udf(parse, MapType(StringType(), StringType())) 

Now you can parse the string and call pyspark.sql.functions.explode():

df.select("ID", explode(parse_udf("ESTRUC_COMP"))).show() #+---+---+-----+ #| ID|key|value| #+---+---+-----+ #| 4A| BQ| 2| #| 4A| AP| 201| #| 8B| IN| 5| #| 8B| BQ| 1| #| 8B| AP| 501| #+---+---+-----+ 

Finally, pivot() to get the keys as the columns. You can use first() as the aggregate function because we know that the key-value relationship is one-to-one for each ID.

df.select("*", explode(parse_udf("ESTRUC_COMP")))\ .groupBy("ID","ESTRUC_COMP").pivot("key").agg(first("value")).show(truncate=False) #+---+-----------------------------------+---+---+----+ #|ID |ESTRUC_COMP |AP |BQ |IN | #+---+-----------------------------------+---+---+----+ #|4A |{'AP': '201', 'BQ': '2'} |201|2 |null| #|8B |{'AP': '501', 'BQ': '1', 'IN': '5'}|501|1 |5 | #+---+-----------------------------------+---+---+----+ 

Of course, since I defined the udf to return MapType(StringType(), StringType()), all of your resultant columns are going to be strings. You can either cast them or modify the udf accordingly.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.