13

I have some data that looks something like this:

ID1 ID2 ID3 ID1 ID4 ID5 ID3 ID5 ID7 ID6 ... ... 

where each row is a group.

My goal is to have a dictionary for each ID, followed by a set of the other IDs that share >= 1 group with it.

For example, this data would return {ID1: [ID2, ID3, ID4, ID5], ID2:[ID1, ID3] ... }

I can think of 3 options for this, and I'm wondering which is (generally) best:

  1. Check whether an ID is already in the list before adding it
  2. Create sets instead of lists, and add each ID to the set
  3. Add all IDs to the list, then convert all of the lists to sets at the end.
1
  • 1
    Best in what sense. speed? There's always the timeit module... Commented Sep 15, 2013 at 0:23

5 Answers 5

7

TL;DR: Go with option 2. Just use sets from the start.

In Python, sets are hash-sets, and lists are dynamic arrays. Inserting is O(1) for both, but checking if an element exists is O(n) for the list and O(1) for the set.

So option 1 is immediately out. If you are inserting n items and need to check the list every time, then the overall complexity becomes O(n^2).

Options 2 and 3 are both optimal at O(n) overall. Option 2 might be faster in micro-benchnarks because you don't need to move objects between collections. In practice, choose the option that is easier to read and maintain in your specific circumstance.

Sign up to request clarification or add additional context in comments.

7 Comments

Sets and dictionaries are hashtables.
Ok, that makes a lot of sense. Then adding to a set takes constant time, so making the whole set should take linear time. Options 2 and 3 still have the same complexity though, O(n).
@cbarrick What do you make of the answers below that benchmarked it and found option 3 the winner?
Flying and driving are also both O(distance), that hardly makes them equivalent ways to get somewhere. When comparing algorithms with the same big-O, you need to look at the details that are filtered out in that analysis.
@Barmar. Yep. I wrote this years ago and was probably too inexperienced to be posting answers. This is definitely due for a rewrite. Option 1 is still quadratic time, so it should be avoided, period. It'll only be faster for really small sets, and at that point it doesn't matter. Options 2 and 3 are actually pretty much equivalent. Any difference will be case by case, so we can't really make a general answer.
|
7

Option 2 sounds the most logical to me, especially with a defaultdict it should be fairly easy to do :)

import pprint import collections data = '''ID1 ID2 ID3 ID1 ID4 ID5 ID3 ID5 ID7 ID6''' groups = collections.defaultdict(set) for row in data.split('\n'): cols = row.split() for groupcol in cols: for col in cols: if col is not groupcol: groups[groupcol].add(col) pprint.pprint(dict(groups)) 

Results:

{'ID1': set(['ID2', 'ID3', 'ID4', 'ID5']), 'ID2': set(['ID1', 'ID3']), 'ID3': set(['ID1', 'ID2', 'ID5', 'ID6', 'ID7']), 'ID4': set(['ID1', 'ID5']), 'ID5': set(['ID1', 'ID3', 'ID4', 'ID6', 'ID7']), 'ID6': set(['ID3', 'ID5', 'ID7']), 'ID7': set(['ID3', 'ID5', 'ID6'])} 

4 Comments

You do not need to test before before adding and element to the set. A set won't complain if you try to add an element already in the set.
I'm not testing if it's in the set, I'm testing if it's not the groupcol since that has to be excluded from the set.
if col is not groupcol: groups[groupcol].add(col)
Exactly :) Re-read the code and you'll see I'm not doing what you think I'm doing ;)
1

going off of above keredson and updating and rerunning under python3 and adding a 4th option for a list comprehension:

import time class Timer(): def __init__(self, desc): self.desc = desc def __enter__(self): self.start = time.time() def __exit__(self, type, value, traceback): self.finish = time.time() print(self.desc, 'took', self.finish - self.start) data = list(range(4000000)) data = data + data print(f'{len(data):,} items to add') with Timer('option 1'): myset = set() for x in data: if (x not in myset): myset.add(x) print(f'{len(myset):,} items to added') with Timer('option 2'): myset = set() for x in data: myset.add(x) print(f'{len(myset):,} items to added') with Timer('option 3'): mylist = list() for x in data: mylist.append(x) myset = set(mylist) print(f'{len(myset):,} items to added') with Timer('option 4'): myset = set([ x for x in data ]) print(f'{len(myset):,} items to added') 

where the output was:

8,000,000 items to add option 1 took 0.771376371383667 4,000,000 items to added option 2 took 0.6729307174682617 4,000,000 items to added option 3 took 0.6674449443817139 4,000,000 items to added option 4 took 0.3772563934326172 4,000,000 items to added 

and you'll see the list comprehension method works almost twice as fast as the former three.

Comments

0

i agree with the previous analysis that option B is best, but a micro benchmark is often illuminating in these situations:

import time class Timer(object): def __init__(self, desc): self.desc = desc def __enter__(self): self.start = time.time() def __exit__(self, type, value, traceback): self.finish = time.time() print self.desc, 'took', self.finish - self.start data = list(range(4000000)) with Timer('option 2'): myset = set() for x in data: myset.add(x) with Timer('option 3'): mylist = list() for x in data: mylist.append(x) myset = set(mylist) 

the results were surprising to me:

$ python test.py option 2 took 0.652163028717 option 3 took 0.748883008957 

i would have expected at least a 2x speed difference.

Comments

0

So, I timed a few different options, and after a few iterations, came up with the following strategies. I thought that sets2 would be the winner, but listToSet2 was faster for every single type of group.

All of the functions except for listFilter were in the same ballpark - listFilter was much slower.

import random import collections small = [[random.randint(1,25) for _ in range(5)] for i in range(100)] medium = [[random.randint(1,250) for _ in range(5)] for i in range(1000)] mediumLotsReps = [[random.randint(1,25) for _ in range(5)] for i in range(1000)] bigGroups = [[random.randint(1,250) for _ in range(75)] for i in range(100)] huge = [[random.randint(1,2500) for _ in range(5)] for i in range(10000)] def sets(groups): results = collections.defaultdict(set) for group in groups: for i in group: for j in group: if i is not j: results[i].add(j) return results def listToSet(groups): results = collections.defaultdict(list) for group in groups: for i,j in enumerate(group): results[j] += group[:i] + group[i:] return {k:set(v) for k, v in results.iteritems()} def listToSet2(groups): results = collections.defaultdict(list) for group in groups: for i,j in enumerate(group): results[j] += group return {k:set(v)-set([k]) for k, v in results.iteritems()} def sets2(groups): results = collections.defaultdict(set) for group in groups: for i in group: results[i] |= set(group) return {k:v - set([k]) for k, v in results.iteritems()} def listFilter(groups): results = collections.defaultdict(list) for group in groups: for i,j in enumerate(group): filteredGroup = group[:i] + group[i:] results[j] += ([k for k in filteredGroup if k not in results[j]]) return results 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.