1

I have a data frame that looks like so:

>>> l = [('a', 'foo', 1), ('b', 'bar', 1), ('a', 'biz', 6), ('c', 'bar', 3), ('c', 'biz', 2)] >>> df = spark.createDataFrame(l, ('uid', 'code', 'level')) >>> df.show() +---+----+-----+ |uid|code|level| +---+----+-----+ | a| foo| 1| | b| bar| 1| | a| biz| 6| | c| bar| 3| | c| biz| 2| +---+----+-----+ 

What I'm trying to do is group the code and level values into a list of dict and dump that list as a JSON string so that I can save the data frame to disk. The result would look like:

>>> df.show() +---+--------------------------+ |uid| json | +---+--------------------------+ | a| '[{"foo":1}, {"biz":6}]' | | b| '[{"bar":1}]' | | c| '[{"bar":3}, {"biz":2}]' | +---+--------------------------+ 

I'm still pretty new to use PySpark and I'm having a lot of trouble figuring out how to get this result. I almost surely need a groupBy and I've tried implementing this by creating a new StringType column called "json" and then using the pandas_udf decorator but I'm getting errors about unhasable types, because, as I've found out, the way I'm accessing the data is accessing the whole column, not just the row.

>>> df = df.withColumn('json', F.list('')) >>> schema = df.schema >>> @pandas_udf(schema, F.PandasUDFType.GROUPED_MAP) ..: def to_json(pdf): ..: return pdf.assign(serial=json.dumps({pdf.code:pdf.level})) 

I've considered using string concatenation between the two columns and using collect_set but that feels wrong as well since it has the potential to write to disk that which can't be JSON loaded just because it has a string representation. Any help is appreciated.

1 Answer 1

5

There's no need for a pandas_udf in this case. to_json, collect_list and create_map should be all you need:

import pyspark.sql.functions as f df.groupby('uid').agg( f.to_json( f.collect_list( f.create_map('code', 'level') ) ).alias('json') ).show(3, False) +---+---------------------+ |uid|json | +---+---------------------+ |c |[{"bar":3},{"biz":2}]| |b |[{"bar":1}] | |a |[{"foo":1},{"biz":6}]| +---+---------------------+ 
Sign up to request clarification or add additional context in comments.

1 Comment

what if using only one column here i.e. level , i want to give me own structure to json ? For example , [ {first : 3, status : null} , {second : 2, status : "pending"} ] and keep uid column as it is.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.