I have an RDD of 50,000 JSON files that I need to write to a mounted directory in Spark (Databricks). The mounted path looks something like /mnt/myblob/mydata (using Azure). I tried the following, but it turns out that I can't use dbutils inside a Spark job.
def write_json(output_path, json_data): dbutils.fs.put(output_path, json_data) What I currently must do is bring the data locally (to the driver) and then call the write_json method.
records = my_rdd.collect() for r in records: write_json(r['path'], r['json']) This approach works, but takes forever to finish. Is there a faster way?