How do I find the duplicates in a list and create another list with them?

Question

How do I find the duplicates in a list of integers and create another list of the duplicates?

possible duplicate of How do you remove duplicates from a list in Python whilst preserving order? — DhruvPathak
– DhruvPathak, Commented Mar 23, 2012 at 8:29
do you want the duplicates once, or every time it is seen again? — moooeeeep
– moooeeeep, Commented Mar 23, 2012 at 9:21
I think this has been answered with much mmore efficiency here. stackoverflow.com/a/642919/1748045 intersection is a built in method of set and should do exactly what is required — Tom Smith
– Tom Smith, Commented Oct 7, 2014 at 9:43

13 revs, 9 users 67% · Accepted Answer · 2025-04-04 14:24:10Z

To remove duplicates use set(a). To print duplicates, something like:

a = [1,2,3,2,1,5,6,5,5,5] import collections print([item for item, count in collections.Counter(a).items() if count > 1]) ## [1, 2, 5]

Note that Counter is not particularly efficient (timings) and probably overkill here. set will perform better. This code computes a list of unique elements in the source order:

seen = set() uniq = [] for x in a: if x not in seen: uniq.append(x) seen.add(x)

or, more concisely:

seen = set() uniq = [x for x in a if x not in seen and not seen.add(x)]

I don't recommend the latter style, because it is not obvious what not seen.add(x) is doing (the set add() method always returns None, hence the need for not).

To compute the list of duplicated elements without libraries:

seen = set() dupes = set() for x in a: if x in seen: dupes.add(x) else: seen.add(x)

or, more concisely:

seen = set() dupes = {x for x in a if x in seen or seen.add(x)}

If list elements are not hashable, you cannot use sets/dicts and have to resort to a quadratic time solution (compare each with each). For example:

a = [[1], [2], [3], [1], [5], [3]] no_dupes = [x for n, x in enumerate(a) if x not in a[:n]] print no_dupes # [[1], [2], [3], [5]] dupes = [x for n, x in enumerate(a) if x in a[:n]] print dupes # [[1], [3]]

@eric: I guess it's O(n), because it only iterates the list once and set lookups are O(1).
@Hugo, to see the list of duplicates, we just need to create a new list called dup and add an else statement. For example: dup = [] else: dup.append(x)
@oxeimon: you probably got this, but you print is called with parentheses in python 3 print()
converting your answer for set() to get duplicates only. seen = set() then dupe = set(x for x in a if x in seen or seen.add(x))
For Python 3.x: print ([item for item, count in collections.Counter(a).items() if count > 1])

Mateen Ulhaq · Accepted Answer · 2022-07-17 09:29:23Z

470

A very simple solution, but with complexity O(n*n).

>>> xs = [1,2,3,4,4,5,5,6,1] >>> set([x for x in xs if xs.count(x) > 1]) set([1, 4, 5])

edited Jul 17, 2022 at 9:29

Mateen Ulhaq

27.8k21 gold badges121 silver badges155 bronze badges

answered Mar 1, 2013 at 10:19

Ritesh Kumar

4,7691 gold badge14 silver badges2 bronze badges

11 Comments

user3892448 Over a year ago

Is there any reason you use a list comprehension instead of a generator comprehension?

danuker Over a year ago

Indeed, a simple solution, but complexity is squared because each count() parses the list all over again, so don't use for large lists.

John La Rooy Over a year ago

@JohnJ, bubble sort is also simple and works. That doesn't mean we should use it!

Cecil Curry Over a year ago

@watsonic: Your "simple switch" fails to reduce the time complexity from quadratic to squared in the general case. Replacing l with set(l) only reduces the worst-case time complexity and hence does nothing to address the larger-scale efficiency concerns with this answer. It probably wasn't so simple after all. In short, don't do this.

user3064538 Over a year ago

Do not do this.

|

Community · Accepted Answer · 2017-05-23 12:26:34Z

You don't need the count, just whether or not the item was seen before. Adapted that answer to this problem:

def list_duplicates(seq): seen = set() seen_add = seen.add # adds all elements it doesn't know yet to seen and all other to seen_twice seen_twice = set( x for x in seq if x in seen or seen_add(x) ) # turn the set into a list (as requested) return list( seen_twice ) a = [1,2,3,2,1,5,6,5,5,5] list_duplicates(a) # yields [1, 2, 5]

Just in case speed matters, here are some timings:

# file: test.py import collections def thg435(l): return [x for x, y in collections.Counter(l).items() if y > 1] def moooeeeep(l): seen = set() seen_add = seen.add # adds all elements it doesn't know yet to seen and all other to seen_twice seen_twice = set( x for x in l if x in seen or seen_add(x) ) # turn the set into a list (as requested) return list( seen_twice ) def RiteshKumar(l): return list(set([x for x in l if l.count(x) > 1])) def JohnLaRooy(L): seen = set() seen2 = set() seen_add = seen.add seen2_add = seen2.add for item in L: if item in seen: seen2_add(item) else: seen_add(item) return list(seen2) l = [1,2,3,2,1,5,6,5,5,5]*100

Here are the results: (well done @JohnLaRooy!)

$ python -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)' 10000 loops, best of 3: 74.6 usec per loop $ python -mtimeit -s 'import test' 'test.moooeeeep(test.l)' 10000 loops, best of 3: 91.3 usec per loop $ python -mtimeit -s 'import test' 'test.thg435(test.l)' 1000 loops, best of 3: 266 usec per loop $ python -mtimeit -s 'import test' 'test.RiteshKumar(test.l)' 100 loops, best of 3: 8.35 msec per loop

Interestingly, besides the timings itself, also the ranking slightly changes when pypy is used. Most interestingly, the Counter-based approach benefits hugely from pypy's optimizations, whereas the method caching approach I have suggested seems to have almost no effect.

$ pypy -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)' 100000 loops, best of 3: 17.8 usec per loop $ pypy -mtimeit -s 'import test' 'test.thg435(test.l)' 10000 loops, best of 3: 23 usec per loop $ pypy -mtimeit -s 'import test' 'test.moooeeeep(test.l)' 10000 loops, best of 3: 39.3 usec per loop

Apparantly this effect is related to the "duplicatedness" of the input data. I have set l = [random.randrange(1000000) for i in xrange(10000)] and got these results:

$ pypy -mtimeit -s 'import test' 'test.moooeeeep(test.l)' 1000 loops, best of 3: 495 usec per loop $ pypy -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)' 1000 loops, best of 3: 499 usec per loop $ pypy -mtimeit -s 'import test' 'test.thg435(test.l)' 1000 loops, best of 3: 1.68 msec per loop

Just curious - what's the purpose of seen_add = seen.add here?
@Rob This way you just call the function you've looked up once before. Otherwise you would need to look up (a dictionary query) the member function add every time an insert would be necessary.
checked with my own data and Ipython's %timeit your method does look fastest on test BUT: "The slowest run took 4.34 times longer than the fastest. This could mean that an intermediate result is being cached"
@moooeeeep, I added another version to your script for you to try :) Also try pypy if you have it handy and are going for speed.
@JohnLaRooy Nice improvement in performance! Interestingly, when I used pypy to evaluate the results, the Counter-based approach improves significantly.

MSeifert · Accepted Answer · 2018-07-04 18:12:08Z

You can use iteration_utilities.duplicates:

>>> from iteration_utilities import duplicates >>> list(duplicates([1,1,2,1,2,3,4,2])) [1, 1, 2, 2]

or if you only want one of each duplicate this can be combined with iteration_utilities.unique_everseen:

>>> from iteration_utilities import unique_everseen >>> list(unique_everseen(duplicates([1,1,2,1,2,3,4,2]))) [1, 2]

It can also handle unhashable elements (however at the cost of performance):

>>> list(duplicates([[1], [2], [1], [3], [1]])) [[1], [1]] >>> list(unique_everseen(duplicates([[1], [2], [1], [3], [1]]))) [[1]]

That's something that only a few of the other approaches here can handle.

Benchmarks

I did a quick benchmark containing most (but not all) of the approaches mentioned here.

The first benchmark included only a small range of list-lengths because some approaches have O(n**2) behavior.

In the graphs the y-axis represents the time, so a lower value means better. It's also plotted log-log so the wide range of values can be visualized better:

Removing the O(n**2) approaches I did another benchmark up to half a million elements in a list:

As you can see the iteration_utilities.duplicates approach is faster than any of the other approaches and even chaining unique_everseen(duplicates(...)) was faster or equally fast than the other approaches.

One additional interesting thing to note here is that the pandas approaches are very slow for small lists but can easily compete for longer lists.

However as these benchmarks show most of the approaches perform roughly equally, so it doesn't matter much which one is used (except for the 3 that had O(n**2) runtime).

from iteration_utilities import duplicates, unique_everseen from collections import Counter import pandas as pd import itertools def georg_counter(it): return [item for item, count in Counter(it).items() if count > 1] def georg_set(it): seen = set() uniq = [] for x in it: if x not in seen: uniq.append(x) seen.add(x) def georg_set2(it): seen = set() return [x for x in it if x not in seen and not seen.add(x)] def georg_set3(it): seen = {} dupes = [] for x in it: if x not in seen: seen[x] = 1 else: if seen[x] == 1: dupes.append(x) seen[x] += 1 def RiteshKumar_count(l): return set([x for x in l if l.count(x) > 1]) def moooeeeep(seq): seen = set() seen_add = seen.add # adds all elements it doesn't know yet to seen and all other to seen_twice seen_twice = set( x for x in seq if x in seen or seen_add(x) ) # turn the set into a list (as requested) return list( seen_twice ) def F1Rumors_implementation(c): a, b = itertools.tee(sorted(c)) next(b, None) r = None for k, g in zip(a, b): if k != g: continue if k != r: yield k r = k def F1Rumors(c): return list(F1Rumors_implementation(c)) def Edward(a): d = {} for elem in a: if elem in d: d[elem] += 1 else: d[elem] = 1 return [x for x, y in d.items() if y > 1] def wordsmith(a): return pd.Series(a)[pd.Series(a).duplicated()].values def NikhilPrabhu(li): li = li.copy() for x in set(li): li.remove(x) return list(set(li)) def firelynx(a): vc = pd.Series(a).value_counts() return vc[vc > 1].index.tolist() def HenryDev(myList): newList = set() for i in myList: if myList.count(i) >= 2: newList.add(i) return list(newList) def yota(number_lst): seen_set = set() duplicate_set = set(x for x in number_lst if x in seen_set or seen_set.add(x)) return seen_set - duplicate_set def IgorVishnevskiy(l): s=set(l) d=[] for x in l: if x in s: s.remove(x) else: d.append(x) return d def it_duplicates(l): return list(duplicates(l)) def it_unique_duplicates(l): return list(unique_everseen(duplicates(l)))

Benchmark 1

from simple_benchmark import benchmark import random funcs = [ georg_counter, georg_set, georg_set2, georg_set3, RiteshKumar_count, moooeeeep, F1Rumors, Edward, wordsmith, NikhilPrabhu, firelynx, HenryDev, yota, IgorVishnevskiy, it_duplicates, it_unique_duplicates ] args = {2**i: [random.randint(0, 2**(i-1)) for _ in range(2**i)] for i in range(2, 12)} b = benchmark(funcs, args, 'list size') b.plot()

Benchmark 2

funcs = [ georg_counter, georg_set, georg_set2, georg_set3, moooeeeep, F1Rumors, Edward, wordsmith, firelynx, yota, IgorVishnevskiy, it_duplicates, it_unique_duplicates ] args = {2**i: [random.randint(0, 2**(i-1)) for _ in range(2**i)] for i in range(2, 20)} b = benchmark(funcs, args, 'list size') b.plot()

Disclaimer

^{1 This is from a third-party library I have written: iteration_utilities.}

I am going to stick my neck out here and suggest writing a bespoke library to do the work in C rather than Python is probably not the spirit of the answer that was being looked for - but that is a legitimate approach! I like the width of the answer and the graphical display of the results -- very nice to see they are converging, makes me wonder if they ever cross as the inputs increase further! Question: what is the result with mostly sorted lists as opposed to completely random lists?
Thanks for the benchmark, I used this in my answer now. Suggestion for such crowded plot images: Use a higher resolution (I saved mine with plt.savefig('duplicates.png', dpi=300)) and remove most white-space around the actual plot. Then the image is larger and clearer, both in the embedded view and in standalone view (when clicking on the image). The higher resolution increases the file size, but you can counter that by reducing the color depth.

F1Rumors · Accepted Answer · 2015-12-29 22:40:51Z

I came across this question whilst looking in to something related - and wonder why no-one offered a generator based solution? Solving this problem would be:

>>> print list(getDupes_9([1,2,3,2,1,5,6,5,5,5])) [1, 2, 5]

I was concerned with scalability, so tested several approaches, including naive items that work well on small lists, but scale horribly as lists get larger (note- would have been better to use timeit, but this is illustrative).

I included @moooeeeep for comparison (it is impressively fast: fastest if the input list is completely random) and an itertools approach that is even faster again for mostly sorted lists... Now includes pandas approach from @firelynx -- slow, but not horribly so, and simple. Note - sort/tee/zip approach is consistently fastest on my machine for large mostly ordered lists, moooeeeep is fastest for shuffled lists, but your mileage may vary.

Advantages

very quick simple to test for 'any' duplicates using the same code

Assumptions

Duplicates should be reported once only
Duplicate order does not need to be preserved
Duplicate might be anywhere in the list

Fastest solution, 1m entries:

def getDupes(c): '''sort/tee/izip''' a, b = itertools.tee(sorted(c)) next(b, None) r = None for k, g in itertools.izip(a, b): if k != g: continue if k != r: yield k r = k

Approaches tested

import itertools import time import random def getDupes_1(c): '''naive''' for i in xrange(0, len(c)): if c[i] in c[:i]: yield c[i] def getDupes_2(c): '''set len change''' s = set() for i in c: l = len(s) s.add(i) if len(s) == l: yield i def getDupes_3(c): '''in dict''' d = {} for i in c: if i in d: if d[i]: yield i d[i] = False else: d[i] = True def getDupes_4(c): '''in set''' s,r = set(),set() for i in c: if i not in s: s.add(i) elif i not in r: r.add(i) yield i def getDupes_5(c): '''sort/adjacent''' c = sorted(c) r = None for i in xrange(1, len(c)): if c[i] == c[i - 1]: if c[i] != r: yield c[i] r = c[i] def getDupes_6(c): '''sort/groupby''' def multiple(x): try: x.next() x.next() return True except: return False for k, g in itertools.ifilter(lambda x: multiple(x[1]), itertools.groupby(sorted(c))): yield k def getDupes_7(c): '''sort/zip''' c = sorted(c) r = None for k, g in zip(c[:-1],c[1:]): if k == g: if k != r: yield k r = k def getDupes_8(c): '''sort/izip''' c = sorted(c) r = None for k, g in itertools.izip(c[:-1],c[1:]): if k == g: if k != r: yield k r = k def getDupes_9(c): '''sort/tee/izip''' a, b = itertools.tee(sorted(c)) next(b, None) r = None for k, g in itertools.izip(a, b): if k != g: continue if k != r: yield k r = k def getDupes_a(l): '''moooeeeep''' seen = set() seen_add = seen.add # adds all elements it doesn't know yet to seen and all other to seen_twice for x in l: if x in seen or seen_add(x): yield x def getDupes_b(x): '''iter*/sorted''' x = sorted(x) def _matches(): for k,g in itertools.izip(x[:-1],x[1:]): if k == g: yield k for k, n in itertools.groupby(_matches()): yield k def getDupes_c(a): '''pandas''' import pandas as pd vc = pd.Series(a).value_counts() i = vc[vc > 1].index for _ in i: yield _ def hasDupes(fn,c): try: if fn(c).next(): return True # Found a dupe except StopIteration: pass return False def getDupes(fn,c): return list(fn(c)) STABLE = True if STABLE: print 'Finding FIRST then ALL duplicates, single dupe of "nth" placed element in 1m element array' else: print 'Finding FIRST then ALL duplicates, single dupe of "n" included in randomised 1m element array' for location in (50,250000,500000,750000,999999): for test in (getDupes_2, getDupes_3, getDupes_4, getDupes_5, getDupes_6, getDupes_8, getDupes_9, getDupes_a, getDupes_b, getDupes_c): print 'Test %-15s:%10d - '%(test.__doc__ or test.__name__,location), deltas = [] for FIRST in (True,False): for i in xrange(0, 5): c = range(0,1000000) if STABLE: c[0] = location else: c.append(location) random.shuffle(c) start = time.time() if FIRST: print '.' if location == test(c).next() else '!', else: print '.' if [location] == list(test(c)) else '!', deltas.append(time.time()-start) print ' -- %0.3f '%(sum(deltas)/len(deltas)), print print

The results for the 'all dupes' test were consistent, finding "first" duplicate then "all" duplicates in this array:

Finding FIRST then ALL duplicates, single dupe of "nth" placed element in 1m element array Test set len change : 500000 - . . . . . -- 0.264 . . . . . -- 0.402 Test in dict : 500000 - . . . . . -- 0.163 . . . . . -- 0.250 Test in set : 500000 - . . . . . -- 0.163 . . . . . -- 0.249 Test sort/adjacent : 500000 - . . . . . -- 0.159 . . . . . -- 0.229 Test sort/groupby : 500000 - . . . . . -- 0.860 . . . . . -- 1.286 Test sort/izip : 500000 - . . . . . -- 0.165 . . . . . -- 0.229 Test sort/tee/izip : 500000 - . . . . . -- 0.145 . . . . . -- 0.206 * Test moooeeeep : 500000 - . . . . . -- 0.149 . . . . . -- 0.232 Test iter*/sorted : 500000 - . . . . . -- 0.160 . . . . . -- 0.221 Test pandas : 500000 - . . . . . -- 0.493 . . . . . -- 0.499

When the lists are shuffled first, the price of the sort becomes apparent - the efficiency drops noticeably and the @moooeeeep approach dominates, with set & dict approaches being similar but lessor performers:

Finding FIRST then ALL duplicates, single dupe of "n" included in randomised 1m element array Test set len change : 500000 - . . . . . -- 0.321 . . . . . -- 0.473 Test in dict : 500000 - . . . . . -- 0.285 . . . . . -- 0.360 Test in set : 500000 - . . . . . -- 0.309 . . . . . -- 0.365 Test sort/adjacent : 500000 - . . . . . -- 0.756 . . . . . -- 0.823 Test sort/groupby : 500000 - . . . . . -- 1.459 . . . . . -- 1.896 Test sort/izip : 500000 - . . . . . -- 0.786 . . . . . -- 0.845 Test sort/tee/izip : 500000 - . . . . . -- 0.743 . . . . . -- 0.804 Test moooeeeep : 500000 - . . . . . -- 0.234 . . . . . -- 0.311 * Test iter*/sorted : 500000 - . . . . . -- 0.776 . . . . . -- 0.840 Test pandas : 500000 - . . . . . -- 0.539 . . . . . -- 0.540

@moooeeeep - be interested to see your views on the ifilter/izip/tee approach.
this answer is incredibly good.I don't understand that it didn't had more points for the explanations and tests which are very usefull for those that would need it.
python's sort is O(n) when only one item is out of order. You should random.shuffle(c) to account for that. Additionally I can't replicate your results when running the unaltered script either (totally different ordering), so maybe it is dependent on the CPU too.
Thank you @John-La-Rooy, astute observation on the CPU/local machine being impactful - so I should amend the item YYMV. Using the O(n) sort was deliberate: the duplicative element is inserted at different locations specifically to see the impact of the approach if there is a sole duplicate in a good (start of list) or bad (end of list) location with these approaches. I considered a random list - eg random.shuffle - but decided that would only be sensible if I did a lot more runs! I shall have to return and benchmark a multiple run/shuffle equivalent and see what the impact is.
itertools.tee adds unnecessary overhead. Try c = sorted(c) and then a, b = iter(c), iter(c) instead.

wellplayed · Accepted Answer · 2016-10-08 11:10:45Z

Using pandas:

>>> import pandas as pd >>> a = [1, 2, 1, 3, 3, 3, 0] >>> pd.Series(a)[pd.Series(a).duplicated()].values array([1, 3, 3])

Nikhil Prabhu · Accepted Answer · 2016-07-15 17:29:47Z

16

Here's a neat and concise solution -

for x in set(li): li.remove(x) li = list(set(li))

answered Jul 15, 2016 at 17:29

Nikhil Prabhu

1,1423 gold badges15 silver badges25 bronze badges

3 Comments

Nikhil Prabhu Over a year ago

The original list is lost, though. This can be fixed by copying the contents to another list - temp = li[:]

F1Rumors Over a year ago

That's quite a nasty exercise on large lists - removing elements from lists are quite expensive!

lolololol ol Over a year ago

So basically you remove everything in the list that is NOT a duplicate and at the end are just left with duplicates.

yǝsʞǝla · Accepted Answer · 2021-02-17 23:58:34Z

Python 3.8 one-liner if you don't care to write your own algorithm or use libraries:

l = [1,2,3,2,1,5,6,5,5,5] res = [(x, count) for x, g in groupby(sorted(l)) if (count := len(list(g))) > 1] print(res)

Prints item and count:

[(1, 2), (2, 2), (5, 4)]

groupby takes a grouping function so you can define your groupings in different ways and return additional Tuple fields as needed.

Laziness has nothing to do with speed (in the sense of throughput), especially in this case where res is a list, which is eager.
This is great, but should mention that you need from itertools import groupby for this one

A. Kali · Accepted Answer · 2020-09-24 11:11:03Z

I guess the most effective way to find duplicates in a list is:

from collections import Counter def duplicates(values): dups = Counter(values) - Counter(set(values)) return list(dups.keys()) print(duplicates([1,2,3,6,5,2]))

It uses Counter once on all the elements, and then on all unique elements. Subtracting the first one with the second will leave out the duplicates only.

georg · Accepted Answer · 2012-03-23 09:29:34Z

collections.Counter is new in python 2.7:

 Python 2.5.4 (r254:67916, May 31 2010, 15:03:39) [GCC 4.1.2 20080704 (Red Hat 4.1.2-46)] on linux2 a = [1,2,3,2,1,5,6,5,5,5] import collections print [x for x, y in collections.Counter(a).items() if y > 1] Type "help", "copyright", "credits" or "license" for more information. File "", line 1, in AttributeError: 'module' object has no attribute 'Counter' >>>

In an earlier version you can use a conventional dict instead:

a = [1,2,3,2,1,5,6,5,5,5] d = {} for elem in a: if elem in d: d[elem] += 1 else: d[elem] = 1 print [x for x, y in d.items() if y > 1]

Note that with python 3, you'll need parentheses for the print statement

firelynx · Accepted Answer · 2015-12-09 09:52:43Z

I would do this with pandas, because I use pandas a lot

import pandas as pd a = [1,2,3,3,3,4,5,6,6,7] vc = pd.Series(a).value_counts() vc[vc > 1].index.tolist()

Gives

[3,6]

Probably isn't very efficient, but it sure is less code than a lot of the other answers, so I thought I would contribute

Note also that pandas contains a built-in duplicates function pda = pd.Series(a) print list(pda[pda.duplicated()])

HenryDev · Accepted Answer · 2017-05-21 16:40:54Z

How about simply loop through each element in the list by checking the number of occurrences, then adding them to a set which will then print the duplicates. Hope this helps someone out there.

myList = [2 ,4 , 6, 8, 4, 6, 12]; newList = set() for i in myList: if myList.count(i) >= 2: newList.add(i) print(list(newList)) ## [4 , 6]

Nir Alfasi · Accepted Answer · 2017-07-21 16:42:35Z

We can use itertools.groupby in order to find all the items that have dups:

from itertools import groupby myList = [2, 4, 6, 8, 4, 6, 12] # when the list is sorted, groupby groups by consecutive elements which are similar for x, y in groupby(sorted(myList)): # list(y) returns all the occurences of item x if len(list(y)) > 1: print x

The output will be:

4 6

Or more concisely: dupes = [x for x, y in groupby(sorted(myList)) if len(list(y)) > 1]

Chetan_Vasudevan · Accepted Answer · 2018-10-16 18:14:28Z

Without converting to list and probably the simplest way would be something like below. This may be useful during a interview when they ask not to use sets

a=[1,2,3,3,3] dup=[] for each in a: if each not in dup: dup.append(each) print(dup)

======= else to get 2 separate lists of unique values and duplicate values

a=[1,2,3,3,3] uniques=[] dups=[] for each in a: if each not in uniques: uniques.append(each) else: dups.append(each) print("Unique values are below:") print(uniques) print("Duplicate values are below:") print(dups)

This doesn't result in a list of duplicates of a (or the original list) though, this results in a list of all of the unique elements of a (or the original list). What would someone do after they finished forming the list "dup"?

superb rain · Accepted Answer · 2020-11-22 18:14:36Z

This seems somewhat competitive despite its O(n log n) complexity, see benchmarks below.

a = sorted(a) dupes = list(set(a[::2]) & set(a[1::2]))

Sorting brings duplicates next to each other, so they're both at an even index and at an odd index. Unique values are only at an even or at an odd index, not both. So the intersection of even-index values and odd-index values is the duplicates.

Benchmark results:

This uses MSeifert's benchmark, but only with the solutions from the accepted answer (the georgs), the slowest solutions, the fastest solution (excluding it_duplicates as it doesn't uniquify the duplicates), and mine. Otherwise it'd be too crowded and the colors too similar.

First line could be a.sort() if we're allowed to modify the given list, that would be a bit faster. But the benchmark re-uses the same list multiple times, so modifying it would mess with the benchmark.

And apparently set(a[::2]).intersection(a[1::2]) wouldn't create a second set and be a bit faster, but meh, it's also a bit longer.

This is the best out of all the solution taking the complexity into account. Thanks!

yota · Accepted Answer · 2016-02-28 17:51:06Z

the third example of the accepted answer give an erroneous answer and does not attempt to give duplicates. Here is the correct version :

number_lst = [1, 1, 2, 3, 5, ...] seen_set = set() duplicate_set = set(x for x in number_lst if x in seen_set or seen_set.add(x)) unique_set = seen_set - duplicate_set

Julien Perrenoud · Accepted Answer · 2023-08-21 18:02:21Z

Some of these solutions are either quadratic complexity, verbose, or require third-party libraries.

Here's a simple, 2-line answer using the "seen set" strategy that keeps time complexity linear while only doubling memory.

def duplicates(array): seen = set() return { val for val in array if (val in seen or seen.add(val)) } --- duplicates(["a", "b", "c", "a", "d"]) > {'a'}

How it works

val in seen will be true if seen contains the value so it will yield that the element in the returned comprehension
Otherwise, or seen.add(val) will add the element to the seen set (without including in the list because it returns None which expresses as False in this context.)
Everything is wrapped in a set comprehension that ensures we only return each duplicate element once.

user3109122 · Accepted Answer · 2015-05-27 04:09:56Z

A bit late, but maybe helpful for some. For a largish list, I found this worked for me.

l=[1,2,3,5,4,1,3,1] s=set(l) d=[] for x in l: if x in s: s.remove(x) else: d.append(x) d [1,3,1]

Shows just and all duplicates and preserves order.

Igor Vishnevskiy · Accepted Answer · 2016-06-21 02:32:19Z

Very simple and quick way of finding dupes with one iteration in Python is:

testList = ['red', 'blue', 'red', 'green', 'blue', 'blue'] testListDict = {} for item in testList: try: testListDict[item] += 1 except: testListDict[item] = 1 print testListDict

Output will be as follows:

>>> print testListDict {'blue': 3, 'green': 1, 'red': 2}

This and more in my blog http://www.howtoprogramwithpython.com

akhil pathirippilly · Accepted Answer · 2018-10-06 20:27:11Z

I am entering much much late in to this discussion. Even though, I would like to deal with this problem with one liners . Because that's the charm of Python. if we just want to get the duplicates in to a separate list (or any collection),I would suggest to do as below.Say we have a duplicated list which we can call as 'target'

 target=[1,2,3,4,4,4,3,5,6,8,4,3]

Now if we want to get the duplicates,we can use the one liner as below:

 duplicates=dict(set((x,target.count(x)) for x in filter(lambda rec : target.count(rec)>1,target)))

This code will put the duplicated records as key and count as value in to the dictionary 'duplicates'.'duplicate' dictionary will look like as below:

 {3: 3, 4: 4} #it saying 3 is repeated 3 times and 4 is 4 times

If you just want all the records with duplicates alone in a list, its again much shorter code:

 duplicates=filter(lambda rec : target.count(rec)>1,target)

Output will be:

 [3, 4, 4, 4, 3, 4, 3]

This works perfectly in python 2.7.x + versions

Baris Ozensel · Accepted Answer · 2022-04-10 05:18:18Z

One other solution is as following without using any collection library.

a = [1,2,3,5,4,6,4,21,4,6,3,32,5,2,23,5] duplicates = [] for i in a: if a.count(i) > 1 and i not in duplicates: duplicates.append(i) print(duplicates)

The output is [2, 3, 5, 4, 6]

Sundeep471 · Accepted Answer · 2019-02-06 23:05:24Z

Method 1:

list(set([val for idx, val in enumerate(input_list) if val in input_list[idx+1:]]))

Explanation: [val for idx, val in enumerate(input_list) if val in input_list[idx+1:]] is a list comprehension, that returns an element, if the same element is present from it's current position, in list, the index.

Example: input_list = [42,31,42,31,3,31,31,5,6,6,6,6,6,7,42]

starting with the first element in list, 42, with index 0, it checks if the element 42, is present in input_list[1:] (i.e., from index 1 till end of list) Because 42 is present in input_list[1:], it will return 42.

Then it goes to the next element 31, with index 1, and checks if element 31 is present in the input_list[2:] (i.e., from index 2 till end of list), Because 31 is present in input_list[2:], it will return 31.

similarly it goes through all the elements in the list, and will return only the repeated/duplicate elements into a list.

Then because we have duplicates, in a list, we need to pick one of each duplicate, i.e. remove duplicate among duplicates, and to do so, we do call a python built-in named set(), and it removes the duplicates,

Then we are left with a set, but not a list, and hence to convert from a set to list, we use, typecasting, list(), and that converts the set of elements to a list.

Method 2:

def dupes(ilist): temp_list = [] # initially, empty temporary list dupe_list = [] # initially, empty duplicate list for each in ilist: if each in temp_list: # Found a Duplicate element if not each in dupe_list: # Avoid duplicate elements in dupe_list dupe_list.append(each) # Add duplicate element to dupe_list else: temp_list.append(each) # Add a new (non-duplicate) to temp_list return dupe_list

Explanation: Here We create two empty lists, to start with. Then keep traversing through all the elements of the list, to see if it exists in temp_list (initially empty). If it is not there in the temp_list, then we add it to the temp_list, using append method.

If it already exists in temp_list, it means, that the current element of the list is a duplicate, and hence we need to add it to dupe_list using append method.

sergzach · Accepted Answer · 2018-06-20 16:43:43Z

Some other tests. Of course to do...

set([x for x in l if l.count(x) > 1])

...is too costly. It's about 500 times faster (the more long array gives better results) to use the next final method:

def dups_count_dict(l): d = {} for item in l: if item not in d: d[item] = 0 d[item] += 1 result_d = {key: val for key, val in d.iteritems() if val > 1} return result_d.keys()

Only 2 loops, no very costly l.count() operations.

Here is a code to compare the methods for example. The code is below, here is the output:

dups_count: 13.368s # this is a function which uses l.count() dups_count_dict: 0.014s # this is a final best function (of the 3 functions) dups_count_counter: 0.024s # collections.Counter

The testing code:

import numpy as np from time import time from collections import Counter class TimerCounter(object): def __init__(self): self._time_sum = 0 def start(self): self.time = time() def stop(self): self._time_sum += time() - self.time def get_time_sum(self): return self._time_sum def dups_count(l): return set([x for x in l if l.count(x) > 1]) def dups_count_dict(l): d = {} for item in l: if item not in d: d[item] = 0 d[item] += 1 result_d = {key: val for key, val in d.iteritems() if val > 1} return result_d.keys() def dups_counter(l): counter = Counter(l) result_d = {key: val for key, val in counter.iteritems() if val > 1} return result_d.keys() def gen_array(): np.random.seed(17) return list(np.random.randint(0, 5000, 10000)) def assert_equal_results(*results): primary_result = results[0] other_results = results[1:] for other_result in other_results: assert set(primary_result) == set(other_result) and len(primary_result) == len(other_result) if __name__ == '__main__': dups_count_time = TimerCounter() dups_count_dict_time = TimerCounter() dups_count_counter = TimerCounter() l = gen_array() for i in range(3): dups_count_time.start() result1 = dups_count(l) dups_count_time.stop() dups_count_dict_time.start() result2 = dups_count_dict(l) dups_count_dict_time.stop() dups_count_counter.start() result3 = dups_counter(l) dups_count_counter.stop() assert_equal_results(result1, result2, result3) print 'dups_count: %.3f' % dups_count_time.get_time_sum() print 'dups_count_dict: %.3f' % dups_count_dict_time.get_time_sum() print 'dups_count_counter: %.3f' % dups_count_counter.get_time_sum()

All Іѕ Vаиітy · Accepted Answer · 2019-03-14 08:57:45Z

raw_list = [1,2,3,3,4,5,6,6,7,2,3,4,2,3,4,1,3,4,] clean_list = list(set(raw_list)) duplicated_items = [] for item in raw_list: try: clean_list.remove(item) except ValueError: duplicated_items.append(item) print(duplicated_items) # [3, 6, 2, 3, 4, 2, 3, 4, 1, 3, 4]

You basically remove duplicates by converting to set (clean_list), then iterate the raw_list, while removing each item in the clean list for occurrence in raw_list. If item is not found, the raised ValueError Exception is caught and the item is added to duplicated_items list.

If the index of duplicated items is needed, just enumerate the list and play around with the index. (for index, item in enumerate(raw_list):) which is faster and optimised for large lists (like thousands+ of elements)

Ravikiran D · Accepted Answer · 2019-05-11 04:43:53Z

use of list.count() method in the list to find out the duplicate elements of a given list

arr=[] dup =[] for i in range(int(input("Enter range of list: "))): arr.append(int(input("Enter Element in a list: "))) for i in arr: if arr.count(i)>1 and i not in dup: dup.append(i) print(dup)

simple way to find the duplicate elements in list using count function

Wizr · Accepted Answer · 2019-09-06 03:41:29Z

one-liner, for fun, and where a single statement is required.

(lambda iterable: reduce(lambda (uniq, dup), item: (uniq, dup | {item}) if item in uniq else (uniq | {item}, dup), iterable, (set(), set())))(some_iterable)

Haresh Shyara · Accepted Answer · 2015-02-11 12:28:05Z

list2 = [1, 2, 3, 4, 1, 2, 3] lset = set() [(lset.add(item), list2.append(item)) for item in list2 if item not in lset] print list(lset)

tvt173 · Accepted Answer · 2016-03-10 22:12:12Z

There are a lot of answers up here, but I think this is relatively a very readable and easy to understand approach:

def get_duplicates(sorted_list): duplicates = [] last = sorted_list[0] for x in sorted_list[1:]: if x == last: duplicates.append(x) last = x return set(duplicates)

Notes:

If you wish to preserve duplication count, get rid of the cast to 'set' at the bottom to get the full list
If you prefer to use generators, replace duplicates.append(x) with yield x and the return statement at the bottom (you can cast to set later)

John B · Accepted Answer · 2017-02-12 03:52:55Z

Here's a fast generator that uses a dict to store each element as a key with a boolean value for checking if the duplicate item has already been yielded.

For lists with all elements that are hashable types:

def gen_dupes(array): unique = {} for value in array: if value in unique and unique[value]: unique[value] = False yield value else: unique[value] = True array = [1, 2, 2, 3, 4, 1, 5, 2, 6, 6] print(list(gen_dupes(array))) # => [2, 1, 6]

For lists that might contain lists:

def gen_dupes(array): unique = {} for value in array: is_list = False if type(value) is list: value = tuple(value) is_list = True if value in unique and unique[value]: unique[value] = False if is_list: value = list(value) yield value else: unique[value] = True array = [1, 2, 2, [1, 2], 3, 4, [1, 2], 5, 2, 6, 6] print(list(gen_dupes(array))) # => [2, [1, 2], 6]

Andreas Profous · Accepted Answer · 2018-07-25 14:19:12Z

When using toolz:

from toolz import frequencies, valfilter a = [1,2,2,3,4,5,4] >>> list(valfilter(lambda count: count > 1, frequencies(a)).keys()) [2,4]

Collectives™ on Stack Overflow

How do I find the duplicates in a list and create another list with them?

43 Answers 43

20 Comments

11 Comments

5 Comments

Benchmarks

Benchmark 1

Benchmark 2

Disclaimer

2 Comments

9 Comments

Comments

3 Comments

2 Comments

1 Comment

1 Comment

1 Comment

Comments

1 Comment

1 Comment

2 Comments

Comments

How it works

Comments

Comments

1 Comment

Comments

Comments

Comments

Comments

Comments

1 Comment

Comments

Comments

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

43 Answers 43

20 Comments

11 Comments

5 Comments

Benchmarks

Benchmark 1

Benchmark 2

Disclaimer

2 Comments

9 Comments

Comments

3 Comments

2 Comments

1 Comment

1 Comment

1 Comment

Comments

1 Comment

1 Comment

2 Comments

Comments

How it works

Comments

Comments

1 Comment

Comments

Comments

Comments

Comments

Comments

1 Comment

Comments

Comments

Comments

Comments

Comments

Linked

Related