2

I have a big JSON file (exported via Azure Data Factory). If DataFactory finds an issue it adds $ signs between objects. For example, it looks like this:

<br>{...}<br> {...}<br> {...}${...}<br> 

So I have an error for example json.decoder.JSONDecodeError: Extra data: line 1 column 21994 (char 21993)

I was dealing with it in an easy way - notepad++ replacing $ to \n and it was good ;) but now my file is about 1.3 GB and I didn't have a tool to edit such a big file.

I use python to export data from all JSON objects in the file and export them to XMLs.

Now I'm looking for some solution to replace all of the $ signs to newlines \n and clean the file.

The beginning of my code is:

a = open('test.json', 'r', encoding = 'UTF8') data1 = a.readlines() a.close() for i in range(len(data1)): print('Done%d/%d'%(i,len(data1))) jsI = json.loads(data1[i]) 

and there if file occurs to $ sign it is over.

May I ask for some advice on how to replace $ signs with newlines in a file using Python?

1
  • Is this on a Windows, Mac, or Linux system? Commented Jan 31, 2019 at 17:35

2 Answers 2

2

The problem is probably on a.readlines() because it brings the entire file to your memory. When dealing with huge files, it's way more interesting to read it line by line, like this:

with open(fname) as f: for line in f: # Do your magic here, on this loop # No need to close it, since the `with` will take care of that. 

If your objective is to replace every $ with a \n, it will be like this:

with open(fname, "r+") as f: for line in f: line.replace("$", "\n") 
Sign up to request clarification or add additional context in comments.

Comments

0

To be able to handle possible $ characters in strings within JSON objects, you can split the input string data1 with $ into fragments, join the fragments into a string one by one until it is parsable as JSON, at which point you output the string and clear it to move on to the next fragment:

import json candidate = '' for fragment in data1.split('$'): candidate += fragment try: json.loads(candidate) print(candidate) candidate = '' except json.decoder.JSONDecodeError: candidate += '$' continue 

Given data1 = '''{}${"a":"$"}${"b":{"c":2}}''', for example, this outputs:

{} {"a":"$"} {"b":{"c":2}} 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.