Most efficient way to add non-unique data to unique data in bash

Question

I have a massive file with each line being unique. I have a collection of smaller files (but still relatively large) where the lines are not unique. This collection is constantly growing. I need to add the small files into the big file and make sure there are no duplicates in the big file. Right now what I do is add all the files into one, and then run sort -u on it. However this ends up rescanning the entire big file which takes longer and longer as more files come in, and seems inefficient. Is there a better way to do this?

Does each line have a smaller unique key? This sounds like you should be using a simple database, like dbm. — cdarke
– cdarke, Commented Jul 23, 2018 at 6:52
@cdarke Unfortunately I don't have a choice as to how the data is stored — trallgorm
– trallgorm, Commented Jul 23, 2018 at 12:48

dimo414 · Accepted Answer · 2018-07-23 07:53:59Z

If the big file is already sorted, it would be more efficient to sort -u only the smaller files, and then sort -u -m (merge) the result with the big file. -m assumes the inputs are already individually sorted.

Example (untested):

#!/bin/bash # Merges unique lines in the files passed as arguments into BIGFILE. BIGFILE=bigfile.txt TMPFILE=$(mktemp) trap "rm $TMPFILE" EXIT sort -u "$@" > "$TMPFILE" sort -um "$TMPFILE" "$BIGFILE" -o "$BIGFILE"

This answer explains why -o is necessary.

If you like process substitutions you can even do it in a one-liner:

sort -um <(sort -u "$@") "$BIGFILE" -o "$BIGFILE"

Collectives™ on Stack Overflow

Most efficient way to add non-unique data to unique data in bash

1 Answer 1

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Linked

Related