If I have a python list that is has many duplicates, and I want to iterate through each item, but not through the duplicates, is it best to use a set (as in set(mylist), or find another way to create a list without duplicates? I was thinking of just looping through the list and checking for duplicates but I figured that's what set() does when it's initialized.
So if mylist = [3,1,5,2,4,4,1,4,2,5,1,3] and I really just want to loop through [1,2,3,4,5] (order doesn't matter), should I use set(mylist) or something else?
An alternative is possible in the last example, since the list contains every integer between its min and max value, I could loop through range(min(mylist),max(mylist)) or through set(mylist). Should I generally try to avoid using set in this case? Also, would finding the min and max be slower than just creating the set?
In the case in the last example, the set is faster:
from numpy.random import random_integers ids = random_integers(1e3,size=1e6) def set_loop(mylist): idlist = [] for id in set(mylist): idlist.append(id) return idlist def list_loop(mylist): idlist = [] for id in range(min(mylist),max(mylist)): idlist.append(id) return idlist %timeit set_loop(ids) #1 loops, best of 3: 232 ms per loop %timeit list_loop(ids) #1 loops, best of 3: 408 ms per loop
numpy, using a genexp instead of building up a million-elementlistjust to iterate over (and usingxrangeinstead ofrangeif this is Py2), trying to do tight loops in C instead of Python (e.g.,idlist = range(…)instead of aforloop that does the same thing), etc. will all make orders of magnitude more difference.set_loopis equivalent toreturn list(set(mylist)), andlist_looptoreturn range(min(mylist), max(mylist))in 2.x orreturn list(range(min(mylist), max(mylist)))in 3.x. The simpler versions may or may not be significantly faster—but they'll never be slower, and they're a lot easier to read.set, use aset. If the program turns out to be slow, and profiling shows you that building or using thatsetis relevant, then you can look into faster solutions. But if you start off asking the fastest way to do each individual step within your program… well, you should be writing in assembly, not Python.