25

Is there a unix command that can check if any two lines in a file are the same?

For e.g. Consider a file sentences.txt

This is sentence X This is sentence Y This is sentence Z This is sentence X This is sentence A This is sentence B 

We see that the sentence

This is sentence X 

is repeated.

Is there any command that can quickly detect this, so that I can perhaps execute it like this -

$ cat sentences.txt | thecommand Line 1:This is sentence X Line 4:This is sentence X 

3 Answers 3

40

Here is one way to get the exact output you're looking for:

$ grep -nFx "$(sort sentences.txt | uniq -d)" sentences.txt 1:This is sentence X 4:This is sentence X 

Explanation:

The inner $(sort sentences.txt | uniq -d) lists each line that occurs more than once. The outer grep -nFx looks again in sentences.txt for exact -x matches to any of these lines -F and prepends their line number -n

4
  • Your edit just barely beat me from posting the exact same answer. +1 Commented Feb 5, 2014 at 18:40
  • So the $(command) syntax works as a kind of replacement? Commented Feb 5, 2014 at 19:27
  • 2
    @CodeBlue - yes. It's called Command Substitution Commented Feb 5, 2014 at 19:29
  • 8
    sort sentences.txt | uniq -d | grep -nFxf - sentences.txt would be a little more efficient and would avoid potential arg list too long problems. Commented Feb 6, 2014 at 9:34
10

Not exactly what you want, but you can try combining sort and uniq -c -d:

aularon@aularon-laptop:~$ cat input This is sentence X This is sentence Y This is sentence Z This is sentence X This is sentence A This is sentence B aularon@aularon-laptop:~$ sort input | uniq -cd 2 This is sentence X aularon@aularon-laptop:~$ 

2 here is the number of duplications found for the line, from man uniq:

 -c, --count prefix lines by the number of occurrences -d, --repeated only print duplicate lines 
0
6

IF the file contents fit in memory awk is good for this. The standard one-liner in comp.lang.awk (I can't search an instance from this machine but there's several every month) to just detect there is duplication is awk 'n[$0]++' which counts the occurrences of each line value and prints any occurrence(s) other than the first, because the default action is print $0.

To show all occurrences including the first, in your format, but possibly in mixed order when more than one value is duplicated, gets a little more finicky:

awk <sentences.txt ' !($0 in n) {n[$0]=NR;next} \ n[$0] {n[$0]=0; print "Line "n[$0]":"$0} \ {print "Line "NR":"$0} ' 

Shown in multiple lines for clarity, you usually run together in real use. If you do this often you can put the awk script in a file with awk -f, or of course the whole thing in a shell script. Like most simple awk this can be done very similarly with perl -n[a].

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.