2

I have an RDD of 50,000 JSON files that I need to write to a mounted directory in Spark (Databricks). The mounted path looks something like /mnt/myblob/mydata (using Azure). I tried the following, but it turns out that I can't use dbutils inside a Spark job.

def write_json(output_path, json_data): dbutils.fs.put(output_path, json_data) 

What I currently must do is bring the data locally (to the driver) and then call the write_json method.

records = my_rdd.collect() for r in records: write_json(r['path'], r['json']) 

This approach works, but takes forever to finish. Is there a faster way?

2
  • What does your rdd look like? Does it have one fully formed json per records? Commented Apr 9, 2019 at 13:49
  • yes, one well-formed json per record. Commented Apr 9, 2019 at 14:50

1 Answer 1

3

You can use map to perform this operation in parallel.

def write_json(output_path, json_data): with open(output_path, "w") as f: f.write(json_data) my_rdd.map(lambda r: write_json(r['path'], r['json'])) 
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.