40

If I have a python list that is has many duplicates, and I want to iterate through each item, but not through the duplicates, is it best to use a set (as in set(mylist), or find another way to create a list without duplicates? I was thinking of just looping through the list and checking for duplicates but I figured that's what set() does when it's initialized.

So if mylist = [3,1,5,2,4,4,1,4,2,5,1,3] and I really just want to loop through [1,2,3,4,5] (order doesn't matter), should I use set(mylist) or something else?

An alternative is possible in the last example, since the list contains every integer between its min and max value, I could loop through range(min(mylist),max(mylist)) or through set(mylist). Should I generally try to avoid using set in this case? Also, would finding the min and max be slower than just creating the set?


In the case in the last example, the set is faster:

from numpy.random import random_integers ids = random_integers(1e3,size=1e6) def set_loop(mylist): idlist = [] for id in set(mylist): idlist.append(id) return idlist def list_loop(mylist): idlist = [] for id in range(min(mylist),max(mylist)): idlist.append(id) return idlist %timeit set_loop(ids) #1 loops, best of 3: 232 ms per loop %timeit list_loop(ids) #1 loops, best of 3: 408 ms per loop 
8
  • Do you expect this speed difference to actually matter in any program you ever write? Keeping things in numpy, using a genexp instead of building up a million-element list just to iterate over (and using xrange instead of range if this is Py2), trying to do tight loops in C instead of Python (e.g., idlist = range(…) instead of a for loop that does the same thing), etc. will all make orders of magnitude more difference. Commented Feb 27, 2013 at 1:56
  • 2
    More specifically: the whole body of set_loop is equivalent to return list(set(mylist)), and list_loop to return range(min(mylist), max(mylist)) in 2.x or return list(range(min(mylist), max(mylist))) in 3.x. The simpler versions may or may not be significantly faster—but they'll never be slower, and they're a lot easier to read. Commented Feb 27, 2013 at 1:58
  • Do you care if the list remains in the same order after removing the dups? Commented Feb 27, 2013 at 2:06
  • @thewolf Order doesn't matter, which is why I am considering a set. Commented Feb 27, 2013 at 19:12
  • 2
    @askewchan: Really, you're better off writing the most readable thing first. If you want something with the semantics of a set, use a set. If the program turns out to be slow, and profiling shows you that building or using that set is relevant, then you can look into faster solutions. But if you start off asking the fastest way to do each individual step within your program… well, you should be writing in assembly, not Python. Commented Feb 27, 2013 at 19:19

6 Answers 6

39

Just use a set. Its semantics are exactly what you want: a collection of unique items.

Technically you'll be iterating through the list twice: once to create the set, once for your actual loop. But you'd be doing just as much work or more with any other approach.

Sign up to request clarification or add additional context in comments.

Comments

10

set is what you want, so you should use set. Trying to be clever introduces subtle bugs like forgetting to add one tomax(mylist)! Code defensively. Worry about what's faster when you determine that it is too slow.

range(min(mylist), max(mylist) + 1) # <-- don't forget to add 1 

1 Comment

i would love to here your opinion on my answer is it fast whe we deal with large list.
8

While a set may be what you want structure-wise, the question is what is faster. A list is faster. Your example code doesn't accurately compare set vs list because you're converting from a list to a set in set_loop, and then you're creating the list you'll be looping through in list_loop. The set and list you iterate through should be constructed and in memory ahead of time, and simply looped through to see which data structure is faster at iterating:

ids_list = range(1000000) ids_set = set(ids) def f(x): for i in x: pass %timeit f(ids_set) #1 loops, best of 3: 214 ms per loop %timeit f(ids_list) #1 loops, best of 3: 176 ms per loop 

2 Comments

your ids_list is not an actual list, it is an range object but your result is correct, the set does need longer than the list. and the range is even slower than the set. so when > is equivalent to faster then list > set > range
That depends on what version of python you're using. In Python 2, range() returns a list object. In Python 3, it returns a range generator object. Good to know the timing differences for python 3, though: a range object is good for memory, but bad for speed.
6

For simplicity's sake: newList = list(set(oldList))

But there are better options out there if you'd like to get speed/ordering/optimization instead: http://www.peterbe.com/plog/uniqifiers-benchmark

2 Comments

There is no good reason to go back to a list. He already lost the element order when converting it to a set so there's no reason for not staying with the set.
@ThiefMaster There are reasons for wanting to go back to a list, mainly performance. Lists are much faster for iteration than a set and by keeping an internal attribute for each element you can easily convert back to a list and sort it into the proper order.
2

I the list is vary large looping two time over it will take a lot of time and more in the second time you are looping a set not a list and as we know iterating over a set is slower than list.

i think you need the power of generator and set.

def first_test(): def loop_one_time(my_list): # create a set to keep the items. iterated_items = set() # as we know iterating over list is faster then list. for value in my_list: # as we know checking if element exist in set is very fast not # metter the size of the set. if value not in iterated_items: iterated_items.add(value) # add this item to list yield value mylist = [3,1,5,2,4,4,1,4,2,5,1,3] for v in loop_one_time(mylist):pass def second_test(): mylist = [3,1,5,2,4,4,1,4,2,5,1,3] s = set(mylist) for v in s:pass import timeit print(timeit.timeit('first_test()', setup='from __main__ import first_test', number=10000)) print(timeit.timeit('second_test()', setup='from __main__ import second_test', number=10000)) 

out put:

 0.024003583388435043 0.010424674188938422 

Note: this technique order is guaranteed

Comments

0

Go Back To Where The List Is Defined

If the list is part of a larger codebase and needs to be as is then all these responses are right you simply have to convert the list into a set with either the set() function or a set comprehnsion { item for item in list }. However if this is YOUR code base and you know that the list is only going to be used in THIS context you can go back to where the list is defined and define it as a set instead of a list. This means that you have a smaller dataset to begin with and would indeed be faster than taking a list, iterating through it, and then creating a set out of that.

Scope Matters

The other thing to consider here is whether or not that the dataset can be defined as a set comprehension as described above and if not then the scope of where the dataset needs to be a list and where it needs to be a set.

If you were to for example call set() on the list inside a loop. Then every time the loop would be called the list would be iterated through and a set would be created. If instead you created the set outside of the scope of the loop you would only iterate through the list once and create one set.

Set Comprehension vs set()

set() - Takes an iterable argument then returns a set version of that argument while not mutating the original. This means you end up with two datasets instead of one. The iterable that was passed to set() and the new set that is returned from calling set(). This is useful if you need to convert a predefined iterable to a set to get rid of duplicate values for your use but leave the original dataset for others to use as is.

{item for item in thing} - The set comprehension creates a set then executes a for loop filling that set with items from that loop. This is useful if you want to define a set from scratch. This is especially true if you need to modify each item in the loop before adding it to the set, check a conditional before adding the item to the set, or both. For example: cube_even_vals = {num**3 for num in thing if num % 2 ==0} loops through an iterable called thing and if the item is even cubes that value, and then adds it to the set.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.