set issubset performance difference depending on the argument type

Question

why this question ?

I was trying to answer this question: Check if All Values Exist as Keys in Dictionary with something better than a generator comprehension fed to all (python loops, even in comprehensions, slow down execution compared to implicit loops that some functions perform):

all(i in bar for i in foo)

where bar is a dictionary and foo is a list by using set.issubset (with a conversion to set of foo to be able to use foo.issubset(bar)), and didn't succeed to beat the times of the all solution (unless both containers are converted to sets).

my question:

From the documentation of set:

Note, the non-operator versions of union(), intersection(), difference(), and symmetric_difference(), issubset(), and issuperset() methods will accept any iterable as an argument. In contrast, their operator based counterparts require their arguments to be sets. This precludes error-prone constructions like set('abc') & 'cbs' in favor of the more readable set('abc').intersection('cbs').

Okay but the performance really depends on the type of argument, even if the complexity does not (The complextiy of Python issubset()):

import timeit foo = {i for i in range(1, 10000, 2)} bar = foo - {400} n=10000 x = timeit.timeit(setup="foo = {str(i) for i in range(1, 10000, 2)};bar = foo - {'400'}",stmt="bar.issubset(foo)",number=n) print("issubset(set)",x) x = timeit.timeit(setup="foo = {str(i) for i in range(1, 10000, 2)};bar = foo - {'400'};foo=list(foo)",stmt="bar.issubset(foo)",number=n) print("issubset(list)",x) x = timeit.timeit(setup="foo = {str(i):i for i in range(1, 10000, 2)};bar = set(foo) - {'400'}",stmt="bar.issubset(foo)",number=n) print("issubset(dict)",x) x = timeit.timeit(setup="foo = {str(i):i for i in range(1, 10000, 2)}.keys();bar = set(foo) - {'400'}",stmt="bar.issubset(foo)",number=n) print("issubset(dict_keys)",x)

my results (Python 3.4):

issubset(set) 1.6141405847648826 issubset(list) 3.698748032058883 issubset(dict) 3.6300025109004244 issubset(dict_keys) 4.224299651223102

So if a set is passed as the argument, the result is very fast.

Using a list is much slower. I figured out that it was because of the hash that must be done on the strings is costly. So I changed my test inputs with integers like this:

foo = {i for i in range(1, 10000, 2)} bar = foo - {400}

and the results were globally faster but still a huge time difference:

issubset(set) 0.5981848205989139 issubset(list) 1.7991591232742143 issubset(dict) 1.889119736960271 issubset(dict_keys) 2.2531574114632678

I also tried to change dict by dict.keys() as in python 3 the keys is said to be (https://www.python.org/dev/peps/pep-3106/) "a set-like or unordered container object".

But in that case, the result is even worse than with dict or list.

So why does passing a set beats passing a list or a dict or a dict_keys object? I don't see anything mentionned in the documentation about this.

user2357112 · Accepted Answer · 2018-06-06 20:30:25Z

The set.issubset algorithm requires a set to work with (frozensets and subclasses count); if you pass it something else, it will make a set. It's basically all(elem in other for elem in self), and it needs to know that elem in other is efficient and means what it means for sets. The only way it knows how to guarantee that is to ensure other is a set. Making a set is expensive.

(I've glossed over some details. If you want to know exactly what's going on, particularly if you have a weird set subclass, read the source code in the link.)

I expected it probably could be made to work with dict key views (which are sufficiently set-like) but nobody bothered to implement that in CPython yet.
Seems like PyAnySet_Check does allow for frozenset and set subclasses.
@Jean-FrançoisFabre: As wim said, frozensets count as sets for the purpose of this check. (The same exact routine is used for set.issubset and frozenset.issubset, and it accepts the same argument types regardless of the type of self.)

Collectives™ on Stack Overflow

set issubset performance difference depending on the argument type

1 Answer 1

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Linked

Related