Python: how to read + transform 10gb file of concatenated JSONs

I just ran a large batch transform job on AWS Sagemaker and ended up with a 10GB file in the following format:

{"vectors": [[1024 items], ..., [1024 items]]}{"vectors": [[1024 items], ..., [1024 items]]}{"vectors": [[1024 items], ..., [1024 items]]}...{"vectors": [[1024 items], ..., [1024 items]]}

In total there are about 44,000 JSON entries, each with 10 lists of 1,024 items each. How can I extract the JSONs from this file, one by one, to have them available for post processing?

Ideal pseudo code:

for JSON in file: do stuff with JSON

I've tried to use the below snippet, but my kernel dies each time.

with open(path, "r") as f: for line in f: do_stuff(line)

asked Jun 16, 2023 at 5:09

sonny

3434 silver badges18 bronze badges

Why don't you use spark based computation?

mbaxi
– mbaxi

2023-06-16 05:19:55 +00:00
Commented Jun 16, 2023 at 5:19
@mbaxi, no good reason! Which is to say that I'm definitely open to using Spark. It looks like Spark has support for JSON Lines, but my format is unfortunately not quite JSON Lines, as there's no line separator between JSON objects. Do you have any suggestions for reading this in JSON by JSON with Spark?

sonny
– sonny

2023-06-16 05:32:09 +00:00
Commented Jun 16, 2023 at 5:32
1

this should help - stackoverflow.com/questions/69351498/…

mbaxi
– mbaxi

2023-06-16 11:26:21 +00:00
Commented Jun 16, 2023 at 11:26

Add a comment |

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Python: how to read + transform 10gb file of concatenated JSONs

0

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Linked