Better to add item to a set, or convert final list to a set?

Question

I have some data that looks something like this:

ID1 ID2 ID3 ID1 ID4 ID5 ID3 ID5 ID7 ID6 ... ...

where each row is a group.

My goal is to have a dictionary for each ID, followed by a set of the other IDs that share >= 1 group with it.

For example, this data would return {ID1: [ID2, ID3, ID4, ID5], ID2:[ID1, ID3] ... }

I can think of 3 options for this, and I'm wondering which is (generally) best:

Check whether an ID is already in the list before adding it
Create sets instead of lists, and add each ID to the set
Add all IDs to the list, then convert all of the lists to sets at the end.

Best in what sense. speed? There's always the timeit module... — martineau
– martineau, Commented Sep 15, 2013 at 0:23

cbarrick · Accepted Answer · 2020-04-02 20:38:24Z

7

TL;DR: Go with option 2. Just use sets from the start.

In Python, sets are hash-sets, and lists are dynamic arrays. Inserting is O(1) for both, but checking if an element exists is O(n) for the list and O(1) for the set.

So option 1 is immediately out. If you are inserting n items and need to check the list every time, then the overall complexity becomes O(n^2).

Options 2 and 3 are both optimal at O(n) overall. Option 2 might be faster in micro-benchnarks because you don't need to move objects between collections. In practice, choose the option that is easier to read and maintain in your specific circumstance.

edited Apr 2, 2020 at 20:38

answered Sep 15, 2013 at 2:28

cbarrick

1,38113 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Markus Unterwaditzer Over a year ago

Sets and dictionaries are hashtables.

cbarrick Over a year ago

Ok, that makes a lot of sense. Then adding to a set takes constant time, so making the whole set should take linear time. Options 2 and 3 still have the same complexity though, O(n).

Barmar Over a year ago

@cbarrick What do you make of the answers below that benchmarked it and found option 3 the winner?

Barmar Over a year ago

Flying and driving are also both O(distance), that hardly makes them equivalent ways to get somewhere. When comparing algorithms with the same big-O, you need to look at the details that are filtered out in that analysis.

cbarrick Over a year ago

@Barmar. Yep. I wrote this years ago and was probably too inexperienced to be posting answers. This is definitely due for a rewrite. Option 1 is still quadratic time, so it should be avoided, period. It'll only be faster for really small sets, and at that point it doesn't matter. Options 2 and 3 are actually pretty much equivalent. Any difference will be case by case, so we can't really make a general answer.

|

Wolph · Accepted Answer · 2013-09-15 00:14:06Z

Option 2 sounds the most logical to me, especially with a defaultdict it should be fairly easy to do :)

import pprint import collections data = '''ID1 ID2 ID3 ID1 ID4 ID5 ID3 ID5 ID7 ID6''' groups = collections.defaultdict(set) for row in data.split('\n'): cols = row.split() for groupcol in cols: for col in cols: if col is not groupcol: groups[groupcol].add(col) pprint.pprint(dict(groups))

Results:

{'ID1': set(['ID2', 'ID3', 'ID4', 'ID5']), 'ID2': set(['ID1', 'ID3']), 'ID3': set(['ID1', 'ID2', 'ID5', 'ID6', 'ID7']), 'ID4': set(['ID1', 'ID5']), 'ID5': set(['ID1', 'ID3', 'ID4', 'ID6', 'ID7']), 'ID6': set(['ID3', 'ID5', 'ID7']), 'ID7': set(['ID3', 'ID5', 'ID6'])}

You do not need to test before before adding and element to the set. A set won't complain if you try to add an element already in the set.
I'm not testing if it's in the set, I'm testing if it's not the groupcol since that has to be excluded from the set.
Exactly :) Re-read the code and you'll see I'm not doing what you think I'm doing ;)

lucas · Accepted Answer · 2025-03-01 22:45:21Z

going off of above keredson and updating and rerunning under python3 and adding a 4th option for a list comprehension:

import time class Timer(): def __init__(self, desc): self.desc = desc def __enter__(self): self.start = time.time() def __exit__(self, type, value, traceback): self.finish = time.time() print(self.desc, 'took', self.finish - self.start) data = list(range(4000000)) data = data + data print(f'{len(data):,} items to add') with Timer('option 1'): myset = set() for x in data: if (x not in myset): myset.add(x) print(f'{len(myset):,} items to added') with Timer('option 2'): myset = set() for x in data: myset.add(x) print(f'{len(myset):,} items to added') with Timer('option 3'): mylist = list() for x in data: mylist.append(x) myset = set(mylist) print(f'{len(myset):,} items to added') with Timer('option 4'): myset = set([ x for x in data ]) print(f'{len(myset):,} items to added')

where the output was:

8,000,000 items to add option 1 took 0.771376371383667 4,000,000 items to added option 2 took 0.6729307174682617 4,000,000 items to added option 3 took 0.6674449443817139 4,000,000 items to added option 4 took 0.3772563934326172 4,000,000 items to added

and you'll see the list comprehension method works almost twice as fast as the former three.

keredson · Accepted Answer · 2013-09-15 13:15:06Z

i agree with the previous analysis that option B is best, but a micro benchmark is often illuminating in these situations:

import time class Timer(object): def __init__(self, desc): self.desc = desc def __enter__(self): self.start = time.time() def __exit__(self, type, value, traceback): self.finish = time.time() print self.desc, 'took', self.finish - self.start data = list(range(4000000)) with Timer('option 2'): myset = set() for x in data: myset.add(x) with Timer('option 3'): mylist = list() for x in data: mylist.append(x) myset = set(mylist)

the results were surprising to me:

$ python test.py option 2 took 0.652163028717 option 3 took 0.748883008957

i would have expected at least a 2x speed difference.

Jeremy · Accepted Answer · 2013-09-15 21:42:13Z

So, I timed a few different options, and after a few iterations, came up with the following strategies. I thought that sets2 would be the winner, but listToSet2 was faster for every single type of group.

All of the functions except for listFilter were in the same ballpark - listFilter was much slower.

import random import collections small = [[random.randint(1,25) for _ in range(5)] for i in range(100)] medium = [[random.randint(1,250) for _ in range(5)] for i in range(1000)] mediumLotsReps = [[random.randint(1,25) for _ in range(5)] for i in range(1000)] bigGroups = [[random.randint(1,250) for _ in range(75)] for i in range(100)] huge = [[random.randint(1,2500) for _ in range(5)] for i in range(10000)] def sets(groups): results = collections.defaultdict(set) for group in groups: for i in group: for j in group: if i is not j: results[i].add(j) return results def listToSet(groups): results = collections.defaultdict(list) for group in groups: for i,j in enumerate(group): results[j] += group[:i] + group[i:] return {k:set(v) for k, v in results.iteritems()} def listToSet2(groups): results = collections.defaultdict(list) for group in groups: for i,j in enumerate(group): results[j] += group return {k:set(v)-set([k]) for k, v in results.iteritems()} def sets2(groups): results = collections.defaultdict(set) for group in groups: for i in group: results[i] |= set(group) return {k:v - set([k]) for k, v in results.iteritems()} def listFilter(groups): results = collections.defaultdict(list) for group in groups: for i,j in enumerate(group): filteredGroup = group[:i] + group[i:] results[j] += ([k for k in filteredGroup if k not in results[j]]) return results

Collectives™ on Stack Overflow

Better to add item to a set, or convert final list to a set?

5 Answers 5

7 Comments

4 Comments

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

7 Comments

4 Comments

Comments

Comments

Comments

Linked

Related