I am using Databricks Autoloader with PySpark to stream Parquet files into a Delta table. Here's a simplified version of what I am doing:
spark.readStream \ .format("cloudFiles") \ .option("cloudFiles.format", "parquet") \ .load("path") \ .writeStream \ .format("delta") \ .outputMode("append") \ .toTable("my_table") What I want to ensure is that every ingested file has the exact same column names and order as the target Delta table (my_table). This is to avoid scenarios where, for example, column values are written into incorrect columns due to schema mismatches.
I know that .schema(...) can be used on readStream, but this seems to enforce a static schema — whereas I want to validate the schema of each incoming file dynamically and reject any file that does not match.
I was hoping to use .foreachBatch(...) to perform per-batch validation logic before writing to the table, but .foreachBatch() is not available on .readStream(). Is there a way to validate incoming file schema (names and order) before writing with Autoloader?
If I could use Autoloader to understand which files are next to be loaded maybe I can check the parquet header files automatically without moving the Autoloader index forward?