Is there some better solution for printing unique lines other than a combination of sort and uniq?
4 Answers
To print each identical line only one, in any order:
sort -u To print only the unique lines, in any order:
sort | uniq -u To print each identical line only once, in the order of their first occurrence: (for each line, print the line if it hasn't been seen yet, then in any case increment the seen counter)
awk '!seen[$0] {print} {++seen[$0]}' To print only the unique lines, in the order of their first occurrence: (record each line in seen, and also in lines if it's the first occurrence; at the end of the input, print the lines in order of occurrence but only the ones seen only once)
awk '!seen[$0]++ {lines[i++]=$0} END {for (i in lines) if (seen[lines[i]]==1) print lines[i]}' - 11how about
awk '!seen[$0]++ {print}'?asoundmove– asoundmove2011-03-23 03:26:53 +00:00Commented Mar 23, 2011 at 3:26 - 17Or even shorter
awk '!seen[$0]++', since the{print}is implied by an empty command.quazgar– quazgar2015-06-04 10:23:31 +00:00Commented Jun 4, 2015 at 10:23 -
sort -uworked perfectly. however,sort | uniq -uwas missing lines !Chris– Chris2020-09-09 11:32:46 +00:00Commented Sep 9, 2020 at 11:32 - 2@Chris
sort | uniq -uonly prints the unique lines. In other words, it removes all copies of duplicate lines. In contrast,sort -uorsort | uniqkeeps a single copy of duplicate lines.Gilles 'SO- stop being evil'– Gilles 'SO- stop being evil'2020-09-09 13:07:23 +00:00Commented Sep 9, 2020 at 13:07
Some (most?) versions of sort have a -u flag that does the uniq part directly. Might be some line length restrictions depending on the implementation though, but you had those already with plain sort|uniq.
- 1Er?
sort -ugoes back to V7 at least.geekosaur– geekosaur2011-03-22 22:46:10 +00:00Commented Mar 22, 2011 at 22:46 - Hum... I thought I remembered Solaris or AIX not having that. I'm wrong though, they both have it.Mat– Mat2011-03-22 22:50:31 +00:00Commented Mar 22, 2011 at 22:50
- Solaris and AIX have
-ubut also have a 512-character line length restriction. (Actually, I think somewhere around Solaris 9 Sun upped it to 5120. GNU still wins, though.)geekosaur– geekosaur2011-03-22 22:52:15 +00:00Commented Mar 22, 2011 at 22:52 - @geekosaur: are you sure? The work done to remove the 512-byte limit on line length in sort was documented in 'Theory and Practice in the Construction of a Working Sort Routine' by J P Linderman, Bell System Technical. Journal, 63, 1827- 1843 (1984).Jonathan Leffler– Jonathan Leffler2011-03-23 03:32:03 +00:00Commented Mar 23, 2011 at 3:32
For the last part of the answer mentioned in : Printing unique lines by @Gilles as an answer to this question, I tried to eliminate the need for using two hashes.
This solution is for : To print only the unique lines, in the order of their first occurrence:
awk '{counter[$0]++} END {for (line in counter) if (counter[line]==1) print line}'
Here, "counter" stores a count of each line that is similar to the one processed earlier.
At the end, we print only those lines, that have counter value as 1.
Does Perl work for you? It can keep the lines in the original order, even if the duplicates are not adjacent. You could also code it in Python, or awk.
while (<>) { print if $lines{$_}++ == 0; } Which can be shortened to just
perl -ne 'print unless $lines{$_}++;' Given input file:
abc def abc ghi abc def abc ghi jkl It yields the output:
abc def ghi jkl - Where is $lines getting defined?Gregg Leventhal– Gregg Leventhal2014-07-20 03:09:53 +00:00Commented Jul 20, 2014 at 3:09
- It isn't. Since there isn't a
use strict;oruse warnings;(actually, it isstrictthat is most relevant here), there is no complaint about using%linesbefore it is defined. If run with strictures, there'd need to be a linemy %lines;before the loop. Note, too, that the hash is%lines; one element of the hash is referenced using the$lines{$_}notation.Jonathan Leffler– Jonathan Leffler2014-07-20 04:47:37 +00:00Commented Jul 20, 2014 at 4:47 - I think the
sortsolutions may be better for large amount of data (the OP was concerned about "storing the entire file in memory").sortwill perform an out-of-core sort if the data is larger than the available memory.2017-04-06 08:29:57 +00:00Commented Apr 6, 2017 at 8:29
sort(eg. GNU coreutils) use temporary files and external mergesort if the input is too big to fit in RAM. And most other versions have a-moption so this can be done explicitly by chunking the input (eg. withsplit), sorting each chunk, and then merging the chunks