Find duplicate lines in a file and count how many time each line was duplicated?

Question

Suppose I have a file similar to the following:

123 123 234 234 123 345

I would like to find how many times '123' was duplicated, how many times '234' was duplicated, etc. So ideally, the output would be like:

123 3 234 2 345 1

What language do you want to use?

VMAtm
– VMAtm

2011-07-15 19:55:04 +00:00
Commented Jul 15, 2011 at 19:55 — VMAtm
– VMAtm, Commented Jul 15, 2011 at 19:55

Benjamin Loison · Accepted Answer · 2025-10-22 13:16:23Z

1040

Assuming there is one number per line:

sort <file> | uniq -c

You can use the more verbose --count flag too with the GNU version, e.g., on Linux:

sort <file> | uniq --count

edited Oct 22 at 13:16

Benjamin Loison

5,7514 gold badges20 silver badges37 bronze badges

answered Jul 15, 2011 at 19:56

wonk0

14.1k1 gold badge23 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Droggl Over a year ago

This is what I do however algorithmically this is doesnt seem to be the most efficient approach (O(n log n)*avg_line_len where n is number of lines). I'm working on files that are several gigabytes large, so performance is a key issue. I wonder whether there is a tool that does just the counting in a single pass using a prefix tree (in my case strings often have common prefixes) or similar, that should do the trick in O(n) * avg_line_len. Does anyone know such a commandline tool?

foobarfuzzbizz Over a year ago

An additional step is to pipe the output of that into a final 'sort -n' command. That will sort the results by which lines occur most often.

DmitrySandalov Over a year ago

If you want to only print duplicate lines, use 'uniq -d'

K.Sy Jul 21 at 10:02

NOTE for anyone looking to get all duplicates: depending on your uniq version or platform, uniq -d might only output one copy per duplicated line. If you want all copies, look for options -D and/or --all-repeated.

Abhishek Kashyap Over a year ago

If you want to again sort the result, you may use sort again like: sort <file> | uniq -c | sort -n

Frank N Over a year ago

if @DmitrySandalov hat not mentioned -d I would have taken … | uniq -c | grep -v '^\s*1' (-v means inverse regexp, that denies matches (not verbose, not version :))

|

Benjamin Loison · Accepted Answer · 2025-10-22 13:16:07Z

This will print duplicate lines only, with counts:

sort FILE | uniq -cd

or, with GNU long options (on Linux):

sort FILE | uniq --count --repeated

on BSD and OSX you have to use grep to filter out unique lines:

sort FILE | uniq -c | grep -v '^ *1 '

For the given example, the result would be:

 3 123 2 234

If you want to print counts for all lines including those that appear only once:

sort FILE | uniq -c

or, with GNU long options (on Linux):

sort FILE | uniq --count

For the given input, the output is:

 3 123 2 234 1 345

In order to sort the output with the most frequent lines on top, you can do the following (to get all results):

sort FILE | uniq -c | sort -nr

or, to get only duplicate lines, most frequent first:

sort FILE | uniq -cd | sort -nr

on OSX and BSD the final one becomes:

sort FILE | uniq -c | grep -v '^ *1 ' | sort -nr

Good point with the --repeated or -d option. So much more accurate than using "|grep 2" or similar!
How I can modify this command to retrieve all lines whose repetition count is more than 100 ?
@Black_Rider awk seems able to do all kind of calculations: in your case you could do | awk '$1>100'
@fionbio Looks like you can't use -c and -d together on OSX uniq. Thanks for pointing out. You can use grep to filter out unique lines: sort FILE | uniq -c | grep -v '^ *1 '
sort FILE | uniq -c | grep -v '^ *1 ' | sort -nr is beautiful!

Benjamin Loison · Accepted Answer · 2025-10-22 13:16:34Z

To find and count duplicate lines in multiple files, you can try the following command:

sort <files> | uniq -c | sort -nr

or:

cat <files> | sort | uniq -c | sort -nr

αғsнιη · Accepted Answer · 2015-04-01 13:01:38Z

Via awk:

awk '{dups[$1]++} END{for (num in dups) {print num,dups[num]}}' data

In awk 'dups[$1]++' command, the variable $1 holds the entire contents of column1 and square brackets are array access. So, for each 1st column of line in data file, the node of the array named dups is incremented.

And at the end, we are looping over dups array with num as variable and print the saved numbers first then their number of duplicated value by dups[num].

Note that your input file has spaces on end of some lines, if you clear up those, you can use $0 in place of $1 in command above :)

Isn't this a bit of overkill considering that we have uniq?
sort | uniq and the awk solution have quite different performance & resource trade-offs: if the files are large and the number of different lines is small, the awk solution is a lot more efficient. It is linear in the number of lines and the space usage is linear in the number of different lines. OTOH, the awk solution needs to keeps all the different lines in memory, while (GNU) sort can resort to temp files.

Amazon Dies In Darkness · Accepted Answer · 2022-06-04 12:20:43Z

19

In Windows, using "Windows PowerShell", I used the command mentioned below to achieve this

Get-Content .\file.txt | Group-Object | Select Name, Count

Also, we can use the where-object Cmdlet to filter the result

Get-Content .\file.txt | Group-Object | Where-Object { $_.Count -gt 1 } | Select Name, Count

edited Jun 4, 2022 at 12:20

Amazon Dies In Darkness

5,92112 gold badges62 silver badges80 bronze badges

answered May 5, 2017 at 16:12

vineel

3,8032 gold badges34 silver badges35 bronze badges

2 Comments

jparram Over a year ago

can you delete all occurrences of the duplicates except the last one...without changing the sort order of the file?

not2qubit Over a year ago

Similarly to below, you can, of course, also sort, using ...| Sort -Top 15 -Descending Count | Select Name

Benjamin Loison · Accepted Answer · 2025-10-22 13:16:46Z

19

To find duplicate counts, use this command:

sort filename | uniq -c | awk '{print $2, $1}'

edited Oct 22 at 13:16

Benjamin Loison

5,7514 gold badges20 silver badges37 bronze badges

answered Jul 20, 2020 at 5:54

Mohammed Nazim

2033 silver badges6 bronze badges

Comments

Marc B · Accepted Answer · 2011-07-15 19:57:12Z

Assuming you've got access to a standard Unix shell and/or cygwin environment:

tr -s ' ' '\n' < yourfile | sort | uniq -d -c ^--space char

Basically: convert all space characters to linebreaks, then sort the tranlsated output and feed that to uniq and count duplicate lines.

I guess this solution was tailored to a specific case of your own? i.e. you've got a list of words separated by spaces or newlines only. If it's only a list of numbers separated by newlines (no spaces) it will work fine there, but obviously your solution will treat lines containing spaces differently.

jubilatious1 · Accepted Answer · 2025-02-27 17:13:47Z

Using Raku (formerly known as Perl_6)

awk-like syntax:

~$ raku -ne 'BEGIN my %dups; %dups{$_}++; END for %dups.kv -> $k,$v {put $k => $v};' file #OR: ~$ raku -ne 'BEGIN my %dups; %dups{$_}++; END put .key => .value for %dups;' file

Above are answers written in Raku, a member of the Perl-family of programming languages. Raku is Unicode-ready by default, and features a clean Regex syntax.

Here an awk-like syntax is invoked with the -ne (non-autoprinting, linewise) command-line flags. We BEGIN by declaring the %dups hash. The line gets loaded into the $_ topic variable, so in the body of the main loop %dups{$_}++ each line is added to the hash as key, with the (post-incremented) number-of-times seen as value. At the END of reading all lines, the %dups hash is output as (\t tab-separated) key/value pairs.

Sample Input:

123 123 234 234 123 345

Sample Output:

123 3 345 1 234 2

NOTE 1: The question asks for duplicates, but the OP's example output includes lines that have only been seen once (hence not duplicated). If you really want duplicates-only, add a clause such as if $v > 1 (added to end of first answer above) or if .value > 1 ( added in middle of second answer above).

NOTE 2: Sometimes text lines can be a little sloppy (leading/trailing whitespace, capitalization), yet you still want them counted as one key. In that case you can clean-up text using various Raku routines such as %dups{$_.trim.lc}++ in the main body of the loop.

https://raku.org

Collectives™ on Stack Overflow

Find duplicate lines in a file and count how many time each line was duplicated?

8 Answers 8

9 Comments

10 Comments

Comments

2 Comments

2 Comments

Comments

1 Comment

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

8 Answers 8

9 Comments

10 Comments

Comments

2 Comments

2 Comments

Comments

1 Comment

Comments

Linked

Related