Return to Revisions

1 of 3

asked Jul 13, 2024 at 21:43

7.1k
2
19
49

Databricks Autoloader Schema Hint are not taken into consideration in schema file

I am using Autoloader with Schema Inference to automatically load some data into S3.

I have one column that is a Map which is overwhelming Autoloader (it tries to infer it as struct -> creating a struct with all keys as properties), so I just use a schema hint for that column.

My output data frame / Delta Table looks exactly as expected, so schema hint works great in that regard.

The only problem that I am facing is that on schema inference the schema hint does not seem to be taken into account. The stage of the Spark job is very slow and the schema file that Autoloader produces is still insanely huge causing driver OOMs. Has anyone faced something similar before?

The code is very simple:

spark \ .readStream \ .format("cloudFiles") \ .option("cloudFiles.format", "json") \ .option("cloudFiles.inferColumnTypes", "true") \ .option("cloudFiles.schemaHints", SCHEMA_HINT) \ .option("cloudFiles.schemaLocation", f"{target_s3_bucket}/_schema/{source_table_name}") \ .load(f"{source_s3_bucket}/{source_table_name}") \ .writeStream \ .trigger(availableNow=True) \ .format("delta") \ .option("mergeSchema", "true") \ .option("checkpointLocation", f"{target_s3_bucket}/_checkpoint/{source_table_name}") \ .option("streamName", source_table_name) \ .start(f"{target_s3_bucket}/{target_table_name}")

asked Jul 13, 2024 at 21:43

Robert Kossendey

7.1k
2
19
49

Collectives™ on Stack Overflow