Creating JSON String from Two Columns in PySpark GroupBy

Question

I have a data frame that looks like so:

>>> l = [('a', 'foo', 1), ('b', 'bar', 1), ('a', 'biz', 6), ('c', 'bar', 3), ('c', 'biz', 2)] >>> df = spark.createDataFrame(l, ('uid', 'code', 'level')) >>> df.show() +---+----+-----+ |uid|code|level| +---+----+-----+ | a| foo| 1| | b| bar| 1| | a| biz| 6| | c| bar| 3| | c| biz| 2| +---+----+-----+

What I'm trying to do is group the code and level values into a list of dict and dump that list as a JSON string so that I can save the data frame to disk. The result would look like:

>>> df.show() +---+--------------------------+ |uid| json | +---+--------------------------+ | a| '[{"foo":1}, {"biz":6}]' | | b| '[{"bar":1}]' | | c| '[{"bar":3}, {"biz":2}]' | +---+--------------------------+

I'm still pretty new to use PySpark and I'm having a lot of trouble figuring out how to get this result. I almost surely need a groupBy and I've tried implementing this by creating a new StringType column called "json" and then using the pandas_udf decorator but I'm getting errors about unhasable types, because, as I've found out, the way I'm accessing the data is accessing the whole column, not just the row.

>>> df = df.withColumn('json', F.list('')) >>> schema = df.schema >>> @pandas_udf(schema, F.PandasUDFType.GROUPED_MAP) ..: def to_json(pdf): ..: return pdf.assign(serial=json.dumps({pdf.code:pdf.level}))

I've considered using string concatenation between the two columns and using collect_set but that feels wrong as well since it has the potential to write to disk that which can't be JSON loaded just because it has a string representation. Any help is appreciated.

akuiper · Accepted Answer · 2019-02-18 01:12:06Z

There's no need for a pandas_udf in this case. to_json, collect_list and create_map should be all you need:

import pyspark.sql.functions as f df.groupby('uid').agg( f.to_json( f.collect_list( f.create_map('code', 'level') ) ).alias('json') ).show(3, False) +---+---------------------+ |uid|json | +---+---------------------+ |c |[{"bar":3},{"biz":2}]| |b |[{"bar":1}] | |a |[{"foo":1},{"biz":6}]| +---+---------------------+

what if using only one column here i.e. level , i want to give me own structure to json ? For example , [ {first : 3, status : null} , {second : 2, status : "pending"} ] and keep uid column as it is.

Collectives™ on Stack Overflow

Creating JSON String from Two Columns in PySpark GroupBy

1 Answer 1

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Linked

Related