0

I just ran a large batch transform job on AWS Sagemaker and ended up with a 10GB file in the following format:

{"vectors": [[1024 items], ..., [1024 items]]}{"vectors": [[1024 items], ..., [1024 items]]}{"vectors": [[1024 items], ..., [1024 items]]}...{"vectors": [[1024 items], ..., [1024 items]]} 

In total there are about 44,000 JSON entries, each with 10 lists of 1,024 items each. How can I extract the JSONs from this file, one by one, to have them available for post processing?

Ideal pseudo code:

for JSON in file: do stuff with JSON 

I've tried to use the below snippet, but my kernel dies each time.

with open(path, "r") as f: for line in f: do_stuff(line) 
3
  • Why don't you use spark based computation? Commented Jun 16, 2023 at 5:19
  • @mbaxi, no good reason! Which is to say that I'm definitely open to using Spark. It looks like Spark has support for JSON Lines, but my format is unfortunately not quite JSON Lines, as there's no line separator between JSON objects. Do you have any suggestions for reading this in JSON by JSON with Spark? Commented Jun 16, 2023 at 5:32
  • 1
    this should help - stackoverflow.com/questions/69351498/… Commented Jun 16, 2023 at 11:26

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.