I have created a function to read JSON as a string with its schema. Then using that function in spark streaming. I am getting error while doing so. The same piece works when I create schema first, then use that schema to read, but doesn't work in single line. How can I fix it?
def processBatch(microBatchOutputDF: DataFrame, batchId: Long) { TOPICS.split(',').foreach(topic =>{ var TableName = topic.split('.').last.toUpperCase var df = microBatchOutputDF /*var schema = schema_of_json(df .select($"value") .filter($"topic".contains(topic)) .as[String] )*/ var jsonDataDf = df.filter($"topic".contains(topic)) .withColumn("jsonData", from_json($"value", schema_of_json(lit($"value".as[String])), scala.collection.immutable.Map[String, String]().asJava)) var srcTable = jsonDataDf .select(col(s"jsonData.payload.after.*"), $"offset", $"timestamp") srcTable .select(srcTable.columns.map(c => col(c).cast(StringType)) : _*) .write .mode("append").format("delta").save("/mnt/datalake/raw/kafka/" + TableName) spark.sql(s"""CREATE TABLE IF NOT EXISTS kafka_raw.$TableName USING delta LOCATION '/mnt/datalake/raw/kafka/$TableName'""") } ) } Spark streaming code
import org.apache.spark.sql.streaming.Trigger val StreamingQuery = InputDf .select("*") .writeStream.outputMode("update") .option("queryName", "StreamingQuery") .foreachBatch(processBatch _) .start() Error: org.apache.spark.sql.AnalysisException: Schema should be specified in DDL format as a string literal or output of the schema_of_json/schema_of_csv functions instead of schema_of_json(value)