0

I am new to Apache Spark, so forgive me if this is a noob question. I am trying to define a particular schema before reading in the dataset in order to speed up processing. There are a few data types that I am not sure how to define (ArrayType and StructType).

Here is a screenshot of the schema I am working with:

1

Here is what I have so far:

jsonSchema = StructType([StructField("attribution", ArrayType(), True), StructField("averagingPeriod", StructType(), True), StructField("city", StringType(), True), StructField("coordinates", StructType(), True), StructField("country", StringType(), True), StructField("date", StructType(), True), StructField("location", StringType(), True), StructField("mobile", BooleanType(), True), StructField("parameter", StringType(), True), StructField("sourceName", StringType(), True), StructField("sourceType", StringType(), True), StructField("unit", StringType(), True), StructField("value", DoubleType(), True) ]) 

My question is: How do I account for the name and url under the attribution column, the unit and value under the averagingPeriod column, etc?

For reference, here is the dataset I am using: https://registry.opendata.aws/openaq/.

1 Answer 1

1

Here's an example of array type and struct type. I think it should be straightforward to do this for all other columns.

from pyspark.sql.types import * jsonSchema = StructType([ StructField("attribution", ArrayType(StructType([StructField("name", StringType()), StructField("url", StringType())])), True), StructField("averagingPeriod", StructType([StructField("unit", StringType()), StructField("value", DoubleType())]), True), # ... etc. ]) 
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.