I have a data frame that looks like so:
>>> l = [('a', 'foo', 1), ('b', 'bar', 1), ('a', 'biz', 6), ('c', 'bar', 3), ('c', 'biz', 2)] >>> df = spark.createDataFrame(l, ('uid', 'code', 'level')) >>> df.show() +---+----+-----+ |uid|code|level| +---+----+-----+ | a| foo| 1| | b| bar| 1| | a| biz| 6| | c| bar| 3| | c| biz| 2| +---+----+-----+ What I'm trying to do is group the code and level values into a list of dict and dump that list as a JSON string so that I can save the data frame to disk. The result would look like:
>>> df.show() +---+--------------------------+ |uid| json | +---+--------------------------+ | a| '[{"foo":1}, {"biz":6}]' | | b| '[{"bar":1}]' | | c| '[{"bar":3}, {"biz":2}]' | +---+--------------------------+ I'm still pretty new to use PySpark and I'm having a lot of trouble figuring out how to get this result. I almost surely need a groupBy and I've tried implementing this by creating a new StringType column called "json" and then using the pandas_udf decorator but I'm getting errors about unhasable types, because, as I've found out, the way I'm accessing the data is accessing the whole column, not just the row.
>>> df = df.withColumn('json', F.list('')) >>> schema = df.schema >>> @pandas_udf(schema, F.PandasUDFType.GROUPED_MAP) ..: def to_json(pdf): ..: return pdf.assign(serial=json.dumps({pdf.code:pdf.level})) I've considered using string concatenation between the two columns and using collect_set but that feels wrong as well since it has the potential to write to disk that which can't be JSON loaded just because it has a string representation. Any help is appreciated.