Handle JSON structure with Pyspark

Question

I am new to spark and trying to read JSON file of the format below into a spark dataframe. This is the format of my JSON

"elements": [

Q4 { Name:ABC, Language:English, Age:45, Title:SWE }, Q5 { Name:DEF, Language:English, Age:60 Title: Engineer }, Q6 { Name:HIJ, Language:English, Age:57, Title: }

] I want the output to be

Name | Language | Age | Title ABC | English | 45 | SWE DEF | English | 60 | Engineer HIJ | English | 57 | Null

How do I achieve this with pyspark?

Manu Gupta · Accepted Answer · 2019-09-12 06:42:05Z

Please try using

df=spark.read.json()

to read the file. It will convert you data into the dataframe format. You may need to chose JSON element if you need the document inside the element.

--Edited part, If you want to use hard code string, pls refer spark doc: Example content from spark document.

sc = spark.sparkContext jsonStrings = ['{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}'] otherPeopleRDD = sc.parallelize(jsonStrings) otherPeople = spark.read.json(otherPeopleRDD) otherPeople.show() # +---------------+----+ # | address|name| # +---------------+----+ # |[Columbus,Ohio]| Yin| # +---------------+----+

--Edit2 With your example but I picked only the required data to create dataframe here. I hope, this will work for you.

 import os import sys from pyspark.sql import SparkSession import json from pyspark.sql import Row spark = SparkSession.builder.master("local").getOrCreate() json_doc1='{"elements": {"Q4":{"Name":"ABC","Language":"English","Age":45,"Title":"SWE"},"Q5": {"Name":"DEF","Language":"English","Age":60,"Title": "Engineer"}}}' test=json.loads(json_doc1) data1=test['elements'].values() print (data1) #rddd1= sc.parallelize() df1=spark.createDataFrame(Row(**x) for x in data1) df1.show() +---+--------+----+--------+ |Age|Language|Name| Title| +---+--------+----+--------+ | 60| English| DEF|Engineer| | 45| English| ABC| SWE| +---+--------+----+--------+

Thanks, Manu

Thanks.From your example, if I wanted to make Name, City and State as columns on my dataframe and remove the address struct , how would I go about doing that?
Hey first thing you need to do is..Correct the JSON document. The JSON you have given here is not correct. You can easily validate the JSON using online tools.
It's always good to update if your work is done. If not, then also, you should update.

Collectives™ on Stack Overflow

Handle JSON structure with Pyspark

1 Answer 1

3 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Related