1

I am new to spark and trying to read JSON file of the format below into a spark dataframe. This is the format of my JSON

"elements": [

Q4 { Name:ABC, Language:English, Age:45, Title:SWE }, Q5 { Name:DEF, Language:English, Age:60 Title: Engineer }, Q6 { Name:HIJ, Language:English, Age:57, Title: } 

] I want the output to be

Name | Language | Age | Title ABC | English | 45 | SWE DEF | English | 60 | Engineer HIJ | English | 57 | Null 

How do I achieve this with pyspark?

0

1 Answer 1

2

Please try using

df=spark.read.json() 

to read the file. It will convert you data into the dataframe format. You may need to chose JSON element if you need the document inside the element.

--Edited part, If you want to use hard code string, pls refer spark doc: Example content from spark document.

sc = spark.sparkContext jsonStrings = ['{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}'] otherPeopleRDD = sc.parallelize(jsonStrings) otherPeople = spark.read.json(otherPeopleRDD) otherPeople.show() # +---------------+----+ # | address|name| # +---------------+----+ # |[Columbus,Ohio]| Yin| # +---------------+----+ 

--Edit2 With your example but I picked only the required data to create dataframe here. I hope, this will work for you.

 import os import sys from pyspark.sql import SparkSession import json from pyspark.sql import Row spark = SparkSession.builder.master("local").getOrCreate() json_doc1='{"elements": {"Q4":{"Name":"ABC","Language":"English","Age":45,"Title":"SWE"},"Q5": {"Name":"DEF","Language":"English","Age":60,"Title": "Engineer"}}}' test=json.loads(json_doc1) data1=test['elements'].values() print (data1) #rddd1= sc.parallelize() df1=spark.createDataFrame(Row(**x) for x in data1) df1.show() +---+--------+----+--------+ |Age|Language|Name| Title| +---+--------+----+--------+ | 60| English| DEF|Engineer| | 45| English| ABC| SWE| +---+--------+----+--------+ 

Thanks, Manu

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks.From your example, if I wanted to make Name, City and State as columns on my dataframe and remove the address struct , how would I go about doing that?
Hey first thing you need to do is..Correct the JSON document. The JSON you have given here is not correct. You can easily validate the JSON using online tools.
It's always good to update if your work is done. If not, then also, you should update.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.