Parse multiple json in one line in pyspark

Question

I have a json line as follows:

{"test":"valid2","workflowId":79370,"email":"[email protected]"}{"email":"[email protected]","eventName":"emailOpen","dataFields":{"campaignId":1125010,"ip":"100.100.200.243","userAgentDevice":"Gmail","messageId":"be4e071c11594bb0b4ee3c444fd08b99","emailId":"[email protected]","userAgent":"Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko Firefox/11.0 (via ggpht.com GoogleImageProxy)","workflowName":"DWH TEST 06042020 WITH CALL","locale":null,"templateId":1576122,"emailSubject":"DWH TEST","labels":[],"createdAt":"2020-04-06 15:06:16 +00:00","templateName":"DWH TEST","messageTypeId":27043,"experimentId":79413,"campaignName":"DWH Test Automation","workflowId":79370,"email":"[email protected]","channelId":24365}}{"email":"[email protected]","eventName":"emailOpen","dataFields":{"campaignId":1100,"ip":"50.100.200.243","userAgentDevice":"Gmail","messageId":"zz4e071c11594bb0b4ee3c444fd08b99","emailId":"[email protected]","userAgent":"Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko Firefox/11.0 (via ggpht.com GoogleImageProxy)","workflowName":"TEST","locale":null,"templateId":1576122,"emailSubject":"TEST","labels":"Cambbridge test","createdAt":"2020-04-10 15:06:16 +00:00","templateName":"TEST","messageTypeId":27043,"experimentId":89413,"campaignName":"Cambridge Test","workflowId":18370,"email":"[email protected]","channelId":1111}}{"email":"[email protected]","eventName":"emailClick","dataFields":{"campaignId":1100,"ip":"50.100.200.243","userAgentDevice":"Gmail","messageId":"zzee071c11594bb0b4ee3c444fd08b99","emailId":"[email protected]","userAgent":"Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko Firefox/11.0 (via ggpht.com GoogleImageProxy)","workflowName":"TEST","locale":null,"templateId":1576122,"emailSubject":"TEST","labels":"Cambbridge test","createdAt":"2020-04-10 15:08:16 +00:00","templateName":"TEST","messageTypeId":27043,"experimentId":89413,"campaignName":"Cambridge Test","workflowId":18370,"email":"[email protected]","channelId":1111}}{"test":"valid2","workflowId":79370,"email":"[email protected]"}{"email":"[email protected]","eventName":"emailOpen","dataFields":{"campaignId":1125010,"ip":"100.100.200.243","userAgentDevice":"Gmail","messageId":"be4e071c11594bb0b4ee3c444fd08b99","emailId":"[email protected]","userAgent":"Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko Firefox/11.0 (via ggpht.com GoogleImageProxy)","workflowName":"DWH TEST 06042020 WITH CALL","locale":null,"templateId":1576122,"emailSubject":"DWH TEST","labels":[],"createdAt":"2020-04-06 15:06:16 +00:00","templateName":"DWH TEST","messageTypeId":27043,"experimentId":79413,"campaignName":"DWH Test Automation","workflowId":79370,"email":"[email protected]","channelId":24365}}

As you can see there are multiple jsons in one single line. I need to associate the extra json json object "{"test":"valid2","workflowId":79370,"email":"[email protected]"}", with all/any event jsons followed by it as long as workflowId and email of the extra json match with the event's workflowId and email.

There can multiple such extra jsons followed by events in one single. I dont know how to ready such a file using a combination of python and pyspark. Using pyspark is mandatory. I tried:

df = sql_context.read.json('test.json') df.show()

but the output is just the extra json :

+--------------+------+----------+ | email| test|workflowId| +--------------+------+----------+ |[email protected]|valid1| 79370| +--------------+------+----------+

I would want the output to look like:

 id email event workflow_id custom createdatdate createdattime 0 be4e071c11594bb0b4ee3c444fd08b99 [email protected] emailOpen 79370 valid2 2020414 154248 1 be4e071c11594bb0b4ee3c444fd08b99 [email protected] emailOpen 79370 valid2 2020414 154248

Can anyone guide me on how to process such a file and get resultant df using pyspark

That's not legitimate JSON. If you can assume every JSON document is a single object, you could replace }{ with }\n{ and then use .splitlines(). — Tim Roberts
– Tim Roberts, Commented Sep 27, 2021 at 18:22

Dharman · Accepted Answer · 2021-09-27 18:37:16Z

Because this is malformed JSON, I would recommend that you run a pre-processing step that repairs the file. This can be done easily with the jq command-line utility. See here.

The -c flag is for compact output, and will result in newline-delimited JSON instead of it being pretty-printed.

jq -c . test.json > test_repaired.json

You can then read in that file with Spark like so:

>>> spark \ ... .read \ ... .json('test_repaired.json') \ ... .show() +--------------------+---------------+----------+------+----------+ | dataFields| email| eventName| test|workflowId| +--------------------+---------------+----------+------+----------+ | null| [email protected]| null|valid2| 79370| |{1125010, DWH Tes...| [email protected]| emailOpen| null| null| |{1100, Cambridge ...|[email protected]| emailOpen| null| null| |{1100, Cambridge ...|[email protected]|emailClick| null| null| | null| [email protected]| null|valid2| 79370| |{1125010, DWH Tes...| [email protected]| emailOpen| null| null| +--------------------+---------------+----------+------+----------+

This is great to read the json in a dataframe. But I dont have to have a line row for that extra json. I need to map it with the events followed based on email and worflowid. So ideally there should be only rows in the df for which the email and workflow id in extra json match with the event. dataFields| email| eventName| test| workflowId| +--------------------+---------------+----------+------+----------+ |{1125010, DWH Tes...| [email protected]| emailOpen| valid2| 79370| |{1125010, DWH Tes...| [email protected]| emailOpen| valid2| 79370|
I see. In that case you could programmatically repair the JSON and discard the unneeded data. Or you could read it all into the Dataframe, and then use a df.where clause to remove entries where email is null.
Tim Roberts' recommendation of replacing the "}{" characters with a newline-delimiter is useful for that, or alternatively you can create a JSON array: j: str = "[" + my_json.replace("}{", "},{") + "]", which can then be deserialized with json.loads(j). You can then filter out items that don't have the attributes you need. But if your data is large, I recommend trying to do as much as you can in Spark.

Collectives™ on Stack Overflow

Parse multiple json in one line in pyspark

1 Answer 1

4 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Linked

Related