I have a json line as follows:
{"test":"valid2","workflowId":79370,"email":"[email protected]"}{"email":"[email protected]","eventName":"emailOpen","dataFields":{"campaignId":1125010,"ip":"100.100.200.243","userAgentDevice":"Gmail","messageId":"be4e071c11594bb0b4ee3c444fd08b99","emailId":"[email protected]","userAgent":"Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko Firefox/11.0 (via ggpht.com GoogleImageProxy)","workflowName":"DWH TEST 06042020 WITH CALL","locale":null,"templateId":1576122,"emailSubject":"DWH TEST","labels":[],"createdAt":"2020-04-06 15:06:16 +00:00","templateName":"DWH TEST","messageTypeId":27043,"experimentId":79413,"campaignName":"DWH Test Automation","workflowId":79370,"email":"[email protected]","channelId":24365}}{"email":"[email protected]","eventName":"emailOpen","dataFields":{"campaignId":1100,"ip":"50.100.200.243","userAgentDevice":"Gmail","messageId":"zz4e071c11594bb0b4ee3c444fd08b99","emailId":"[email protected]","userAgent":"Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko Firefox/11.0 (via ggpht.com GoogleImageProxy)","workflowName":"TEST","locale":null,"templateId":1576122,"emailSubject":"TEST","labels":"Cambbridge test","createdAt":"2020-04-10 15:06:16 +00:00","templateName":"TEST","messageTypeId":27043,"experimentId":89413,"campaignName":"Cambridge Test","workflowId":18370,"email":"[email protected]","channelId":1111}}{"email":"[email protected]","eventName":"emailClick","dataFields":{"campaignId":1100,"ip":"50.100.200.243","userAgentDevice":"Gmail","messageId":"zzee071c11594bb0b4ee3c444fd08b99","emailId":"[email protected]","userAgent":"Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko Firefox/11.0 (via ggpht.com GoogleImageProxy)","workflowName":"TEST","locale":null,"templateId":1576122,"emailSubject":"TEST","labels":"Cambbridge test","createdAt":"2020-04-10 15:08:16 +00:00","templateName":"TEST","messageTypeId":27043,"experimentId":89413,"campaignName":"Cambridge Test","workflowId":18370,"email":"[email protected]","channelId":1111}}{"test":"valid2","workflowId":79370,"email":"[email protected]"}{"email":"[email protected]","eventName":"emailOpen","dataFields":{"campaignId":1125010,"ip":"100.100.200.243","userAgentDevice":"Gmail","messageId":"be4e071c11594bb0b4ee3c444fd08b99","emailId":"[email protected]","userAgent":"Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko Firefox/11.0 (via ggpht.com GoogleImageProxy)","workflowName":"DWH TEST 06042020 WITH CALL","locale":null,"templateId":1576122,"emailSubject":"DWH TEST","labels":[],"createdAt":"2020-04-06 15:06:16 +00:00","templateName":"DWH TEST","messageTypeId":27043,"experimentId":79413,"campaignName":"DWH Test Automation","workflowId":79370,"email":"[email protected]","channelId":24365}} As you can see there are multiple jsons in one single line. I need to associate the extra json json object "{"test":"valid2","workflowId":79370,"email":"[email protected]"}", with all/any event jsons followed by it as long as workflowId and email of the extra json match with the event's workflowId and email.
There can multiple such extra jsons followed by events in one single. I dont know how to ready such a file using a combination of python and pyspark. Using pyspark is mandatory. I tried:
df = sql_context.read.json('test.json') df.show() but the output is just the extra json :
+--------------+------+----------+ | email| test|workflowId| +--------------+------+----------+ |[email protected]|valid1| 79370| +--------------+------+----------+ I would want the output to look like:
id email event workflow_id custom createdatdate createdattime 0 be4e071c11594bb0b4ee3c444fd08b99 [email protected] emailOpen 79370 valid2 2020414 154248 1 be4e071c11594bb0b4ee3c444fd08b99 [email protected] emailOpen 79370 valid2 2020414 154248 Can anyone guide me on how to process such a file and get resultant df using pyspark
}{with}\n{and then use.splitlines().