23

Is there some better solution for printing unique lines other than a combination of sort and uniq?

3
  • 1
    What do you mean by "better"? Commented Mar 23, 2011 at 13:31
  • @gabe Not requiring the entire file to be stored in memory for example. Commented Mar 23, 2011 at 13:46
  • Some versions of sort (eg. GNU coreutils) use temporary files and external mergesort if the input is too big to fit in RAM. And most other versions have a -m option so this can be done explicitly by chunking the input (eg. with split), sorting each chunk, and then merging the chunks Commented Jan 23, 2020 at 0:56

4 Answers 4

40

To print each identical line only one, in any order:

sort -u 

To print only the unique lines, in any order:

sort | uniq -u 

To print each identical line only once, in the order of their first occurrence: (for each line, print the line if it hasn't been seen yet, then in any case increment the seen counter)

awk '!seen[$0] {print} {++seen[$0]}' 

To print only the unique lines, in the order of their first occurrence: (record each line in seen, and also in lines if it's the first occurrence; at the end of the input, print the lines in order of occurrence but only the ones seen only once)

awk '!seen[$0]++ {lines[i++]=$0} END {for (i in lines) if (seen[lines[i]]==1) print lines[i]}' 
4
  • 11
    how about awk '!seen[$0]++ {print}'? Commented Mar 23, 2011 at 3:26
  • 17
    Or even shorter awk '!seen[$0]++', since the {print} is implied by an empty command. Commented Jun 4, 2015 at 10:23
  • sort -u worked perfectly. however, sort | uniq -u was missing lines ! Commented Sep 9, 2020 at 11:32
  • 2
    @Chris sort | uniq -u only prints the unique lines. In other words, it removes all copies of duplicate lines. In contrast, sort -u or sort | uniq keeps a single copy of duplicate lines. Commented Sep 9, 2020 at 13:07
4

Some (most?) versions of sort have a -u flag that does the uniq part directly. Might be some line length restrictions depending on the implementation though, but you had those already with plain sort|uniq.

4
  • 1
    Er? sort -u goes back to V7 at least. Commented Mar 22, 2011 at 22:46
  • Hum... I thought I remembered Solaris or AIX not having that. I'm wrong though, they both have it. Commented Mar 22, 2011 at 22:50
  • Solaris and AIX have -u but also have a 512-character line length restriction. (Actually, I think somewhere around Solaris 9 Sun upped it to 5120. GNU still wins, though.) Commented Mar 22, 2011 at 22:52
  • @geekosaur: are you sure? The work done to remove the 512-byte limit on line length in sort was documented in 'Theory and Practice in the Construction of a Working Sort Routine' by J P Linderman, Bell System Technical. Journal, 63, 1827- 1843 (1984). Commented Mar 23, 2011 at 3:32
1

For the last part of the answer mentioned in : Printing unique lines by @Gilles as an answer to this question, I tried to eliminate the need for using two hashes.

This solution is for : To print only the unique lines, in the order of their first occurrence:

awk '{counter[$0]++} END {for (line in counter) if (counter[line]==1) print line}'

Here, "counter" stores a count of each line that is similar to the one processed earlier.
At the end, we print only those lines, that have counter value as 1.

0

Does Perl work for you? It can keep the lines in the original order, even if the duplicates are not adjacent. You could also code it in Python, or awk.

while (<>) { print if $lines{$_}++ == 0; } 

Which can be shortened to just

perl -ne 'print unless $lines{$_}++;' 

Given input file:

abc def abc ghi abc def abc ghi jkl 

It yields the output:

abc def ghi jkl 
3
  • Where is $lines getting defined? Commented Jul 20, 2014 at 3:09
  • It isn't. Since there isn't a use strict; or use warnings; (actually, it is strict that is most relevant here), there is no complaint about using %lines before it is defined. If run with strictures, there'd need to be a line my %lines; before the loop. Note, too, that the hash is %lines; one element of the hash is referenced using the $lines{$_} notation. Commented Jul 20, 2014 at 4:47
  • I think the sort solutions may be better for large amount of data (the OP was concerned about "storing the entire file in memory"). sort will perform an out-of-core sort if the data is larger than the available memory. Commented Apr 6, 2017 at 8:29

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.