Spark: How to define a nested schema?

Question

I am new to Apache Spark, so forgive me if this is a noob question. I am trying to define a particular schema before reading in the dataset in order to speed up processing. There are a few data types that I am not sure how to define (ArrayType and StructType).

Here is a screenshot of the schema I am working with:

Here is what I have so far:

jsonSchema = StructType([StructField("attribution", ArrayType(), True), StructField("averagingPeriod", StructType(), True), StructField("city", StringType(), True), StructField("coordinates", StructType(), True), StructField("country", StringType(), True), StructField("date", StructType(), True), StructField("location", StringType(), True), StructField("mobile", BooleanType(), True), StructField("parameter", StringType(), True), StructField("sourceName", StringType(), True), StructField("sourceType", StringType(), True), StructField("unit", StringType(), True), StructField("value", DoubleType(), True) ])

My question is: How do I account for the name and url under the attribution column, the unit and value under the averagingPeriod column, etc?

For reference, here is the dataset I am using: https://registry.opendata.aws/openaq/.

mck · Accepted Answer · 2021-05-04 08:32:39Z

Here's an example of array type and struct type. I think it should be straightforward to do this for all other columns.

from pyspark.sql.types import * jsonSchema = StructType([ StructField("attribution", ArrayType(StructType([StructField("name", StringType()), StructField("url", StringType())])), True), StructField("averagingPeriod", StructType([StructField("unit", StringType()), StructField("value", DoubleType())]), True), # ... etc. ])

Collectives™ on Stack Overflow

Spark: How to define a nested schema?

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related