Printing unique lines

Question

Is there some better solution for printing unique lines other than a combination of sort and uniq?

@gabe Not requiring the entire file to be stored in memory for example. — Šimon Tóth
– Šimon Tóth, Commented Mar 23, 2011 at 13:46
Some versions of sort (eg. GNU coreutils) use temporary files and external mergesort if the input is too big to fit in RAM. And most other versions have a -m option so this can be done explicitly by chunking the input (eg. with split), sorting each chunk, and then merging the chunks — jhnc
– jhnc, Commented Jan 23, 2020 at 0:56

Gilles 'SO- stop being evil' · Accepted Answer · 2011-03-22 23:06:13Z

To print each identical line only one, in any order:

sort -u

To print only the unique lines, in any order:

sort | uniq -u

To print each identical line only once, in the order of their first occurrence: (for each line, print the line if it hasn't been seen yet, then in any case increment the seen counter)

awk '!seen[$0] {print} {++seen[$0]}'

To print only the unique lines, in the order of their first occurrence: (record each line in seen, and also in lines if it's the first occurrence; at the end of the input, print the lines in order of occurrence but only the ones seen only once)

awk '!seen[$0]++ {lines[i++]=$0} END {for (i in lines) if (seen[lines[i]]==1) print lines[i]}'

Or even shorter awk '!seen[$0]++', since the {print} is implied by an empty command. — quazgar
– quazgar, Commented Jun 4, 2015 at 10:23
sort -u worked perfectly. however, sort | uniq -u was missing lines ! — Chris
– Chris, Commented Sep 9, 2020 at 11:32
@Chris sort | uniq -u only prints the unique lines. In other words, it removes all copies of duplicate lines. In contrast, sort -u or sort | uniq keeps a single copy of duplicate lines. — Gilles 'SO- stop being evil'
– Gilles 'SO- stop being evil', Commented Sep 9, 2020 at 13:07

Mat · Accepted Answer · 2011-03-22 22:54:32Z

4

Some (most?) versions of sort have a -u flag that does the uniq part directly. Might be some line length restrictions depending on the implementation though, but you had those already with plain sort|uniq.

edited Mar 22, 2011 at 22:54

answered Mar 22, 2011 at 22:42

Mat

54.9k11 gold badges164 silver badges143 bronze badges

1

Er? sort -u goes back to V7 at least.

geekosaur
– geekosaur

2011-03-22 22:46:10 +00:00
Commented Mar 22, 2011 at 22:46
Hum... I thought I remembered Solaris or AIX not having that. I'm wrong though, they both have it.

Mat
– Mat

2011-03-22 22:50:31 +00:00
Commented Mar 22, 2011 at 22:50
Solaris and AIX have -u but also have a 512-character line length restriction. (Actually, I think somewhere around Solaris 9 Sun upped it to 5120. GNU still wins, though.)

geekosaur
– geekosaur

2011-03-22 22:52:15 +00:00
Commented Mar 22, 2011 at 22:52
@geekosaur: are you sure? The work done to remove the 512-byte limit on line length in sort was documented in 'Theory and Practice in the Construction of a Working Sort Routine' by J P Linderman, Bell System Technical. Journal, 63, 1827- 1843 (1984).

Jonathan Leffler
– Jonathan Leffler

2011-03-23 03:32:03 +00:00
Commented Mar 23, 2011 at 3:32

Add a comment |

Sarfraaz Ahmed · Accepted Answer · 2017-10-02 16:18:04Z

For the last part of the answer mentioned in : Printing unique lines by @Gilles as an answer to this question, I tried to eliminate the need for using two hashes.

This solution is for : To print only the unique lines, in the order of their first occurrence:

awk '{counter[$0]++} END {for (line in counter) if (counter[line]==1) print line}'

Here, "counter" stores a count of each line that is similar to the one processed earlier.
At the end, we print only those lines, that have counter value as 1.

3 revs, 2 users 96% · Accepted Answer · 2017-04-06 08:07:57Z

0

Does Perl work for you? It can keep the lines in the original order, even if the duplicates are not adjacent. You could also code it in Python, or awk.

while (<>) { print if $lines{$_}++ == 0; }

Which can be shortened to just

perl -ne 'print unless $lines{$_}++;'

Given input file:

abc def abc ghi abc def abc ghi jkl

It yields the output:

abc def ghi jkl

edited Apr 6, 2017 at 8:07

community wiki

3 revs, 2 users 96%
Jonathan Leffler

Where is $lines getting defined?

Gregg Leventhal
– Gregg Leventhal

2014-07-20 03:09:53 +00:00
Commented Jul 20, 2014 at 3:09
It isn't. Since there isn't a use strict; or use warnings; (actually, it is strict that is most relevant here), there is no complaint about using %lines before it is defined. If run with strictures, there'd need to be a line my %lines; before the loop. Note, too, that the hash is %lines; one element of the hash is referenced using the $lines{$_} notation.

Jonathan Leffler
– Jonathan Leffler

2014-07-20 04:47:37 +00:00
Commented Jul 20, 2014 at 4:47
I think the sort solutions may be better for large amount of data (the OP was concerned about "storing the entire file in memory"). sort will perform an out-of-core sort if the data is larger than the available memory.

Kusalananda
– Kusalananda ♦

2017-04-06 08:29:57 +00:00
Commented Apr 6, 2017 at 8:29

Add a comment |

Stack Exchange Network

Printing unique lines

4 Answers 4

You must log in to answer this question.

Linked

Hot Network Questions

Printing unique lines

4 Answers 4

You must log in to answer this question.

Linked

Related

Hot Network Questions