Revision 47f295ca-1bd6-4a66-a336-3cd1255ca937

I have a huge amount of data in which each (data-)line should be unique.

There are a lot of files in one folder in which this is already true. It is about 15GB splitted into roughly 170 files with 1000000 lines. Let's call that folder `foo`.

Now there is a second folder (`bar`) with even more data: In each file, there are no multiple entries. The intersection of two files in `bar` is not necessarily empty. There are roughly 15k lines in each of the files there (and there are several thousands of files in `bar`).

Right now I'm using 

 awk 'NR==FNR{a[$0]=$0;next}!a[$0]' foo/file bar/file > tmp
 mv tmp bar/file

and a loop over all files in `foo` and a loop over all files in `bar`. I break the loop over `foo` if the `bar/file` is empty. I have parallelized this by locking (for use on several nodes) and parallel execution (on each node). But still, this needs a heck of a long time.

What are possibilities of improving performance? What is the ideal file size of files in `foo`? Of course this is machine dependent (RAM/CPU/storage), but what is a good rule of thumbs here?

**tl;dr**: `foo` contains unique data lines, `bar` contains data lines which can appear multiple times in `bar`and `foo`. Eliminate duplicates in `bar` such that they can be merged with `foo`

**[Update]** There are no empty lines **[/Update]**