How do I parallel write JSON files to a mounted directory using Spark in Databricks

Question

I have an RDD of 50,000 JSON files that I need to write to a mounted directory in Spark (Databricks). The mounted path looks something like /mnt/myblob/mydata (using Azure). I tried the following, but it turns out that I can't use dbutils inside a Spark job.

def write_json(output_path, json_data): dbutils.fs.put(output_path, json_data)

What I currently must do is bring the data locally (to the driver) and then call the write_json method.

records = my_rdd.collect() for r in records: write_json(r['path'], r['json'])

This approach works, but takes forever to finish. Is there a faster way?

What does your rdd look like? Does it have one fully formed json per records? — D3V
– D3V, Commented Apr 9, 2019 at 13:49

D3V · Accepted Answer · 2019-04-09 15:43:56Z

You can use map to perform this operation in parallel.

def write_json(output_path, json_data): with open(output_path, "w") as f: f.write(json_data) my_rdd.map(lambda r: write_json(r['path'], r['json']))

Collectives™ on Stack Overflow

How do I parallel write JSON files to a mounted directory using Spark in Databricks

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related