Skip to main content
1 of 2

Remove duplicate lines from multiple json files while preserving file structure

I have a folder with thousands of json files. The contents of each folder look something like this:

file.1424-417982.json file.1424-417995.json file.1424-418013.json file.1424-418015.json (etc.) 

Some of the files contain lines that are duplicated in other files in the folder. For example, the single line

{"a":"fas8d\U0001f638f8gej3","b":527239835} 

May occur in

file.1424-417982.json file.1424-418013.json 

or in some other files.

I would like to run a script that goes through all of the files, records which lines are duplicated in any of the files, then removes all the duplicate occurrences from the files (retaining the first occurrence).

I tried

sort -u *.json > newfile 

and created a massive single file with the unique lines in all of the files, but this isn't useful to me. I'd like to preserve the existing file structure. Thanks for any tips!