Strange performances for set operations

Question

All these results have been obtained with CPython 3.5.2.

I noticed strange performances for some operations of the set class.

I have measured the time needed to perform the union of two sets containing only integers. This time depends on the sizes of the sets, of course. Surprisingly, it also depends on the “density” of the integers. Here is a plot:

The x axis is the sum of the sizes of the two sets (which were chosen randomly and independently from each other, for each experience). The y axis is the time, in seconds (in log scale).

A density d means that the sets were instantiated by sampling N integers from a total of N/d integers. In other words, for a density of 0.5, we take one half of the integers of some interval, whereas for a density of 0.1 we take one tenth of the integers of some (larger) interval.

Here is a minimal code to get some results (if needed I can post the full code I used for the plot, but it is longer).

import time import random import numpy def get_values(size, density): return set(random.sample(range(int(size/density)), size)) def perform_op(size, density): values1 = get_values(size, density) values2 = get_values(size, density) t = time.time() result = values1 | values2 return time.time()-t size = 10000000 for density in [0.05, 0.1, 0.5, 0.99]: times = [perform_op(size, density) for _ in range(10)] print('density: %.2f, mean time: %.4f, standard deviation: %.4f' % (density, numpy.mean(times), numpy.std(times)))

Union:

density: 0.05, time: 0.9846, standard deviation: 0.0440 density: 0.10, time: 1.0141, standard deviation: 0.0204 density: 0.50, time: 0.5477, standard deviation: 0.0059 density: 0.99, time: 0.3440, standard deviation: 0.0020

There is roughly a factor 3 for the computation time between the fastest and the slowest, with sets having a same size. Also, there is much more variability for low densities.

A funny thing is that, for the intersection (replace values1 | values2 by values1 & values2 in perform_op function), we also have non constant performances, but the pattern is different:

density: 0.05, time: 0.3928, standard deviation: 0.0046 density: 0.10, time: 0.4876, standard deviation: 0.0041 density: 0.50, time: 0.5975, standard deviation: 0.0127 density: 0.99, time: 0.3806, standard deviation: 0.0015

I did not test other set operations.

I do not understand why there are such differences. From what I know, Python sets are implemented with hash tables, so the density of the integers should not matter as long as their hashes are well spread.

What is the origin of these different performances?

The efficiency of a hash set will depend on the number of elements that hash to the same bucket. That in turn will depend on the size of the set itself and the distribution of the numbers. I'll let someone more familiar with the implementation of set provide a proper answer. — Mark Ransom
– Mark Ransom, Commented Feb 6, 2017 at 22:09

ShadowRanger · Accepted Answer · 2017-02-07 00:36:47Z

There are two major contributing factors here:

You're producing different size outputs; with dense inputs, the vast majority of the values will overlap, so you'll end up producing much smaller outputs.
int has a very simple hash code; it's just the value of the int. So hash(1234) == 1234. With dense inputs, this means you have mostly contiguous hash codes, with no overlap, because the values are always smaller than the number of set buckets (for example, with 100,000 values, you have 262,144 buckets; when values are dense, your hash codes range from 0 to 101,010 so no actual wraparound occurs modulo 262,144). What's more, the hash codes being largely sequential means that the memory is accessed in a largely sequential pattern (aiding CPU cache fetch heuristics). For sparse inputs, this doesn't apply; you'll have many non-equal values that hash to the same bucket (because each of the 2,000,000 values for the 0.05 case has 7-8 different values that will hash to the same bucket when there are 262,144 buckets). Since Python uses closed hashing (aka open addressing), a bucket collision with non-equal values ends up jumping all over memory (preventing the CPU cache from helping as much) to find a bucket for the new value.

To demonstrate the bucket collision issue:

>>> import random >>> vals = random.sample(xrange(int(100000/0.99)), 100000) >>> vals_sparse = random.sample(xrange(int(100000/0.05)), 100000) # Check the number of unique buckets hashed to for dense and sparse values >>> len({hash(v) % 262144 for v in vals}) 100000 # No bucket overlap at all >>> len({hash(v) % 262144 for v in vals_sparse}) 85002 # ~15% of all values generated produced a bucket collision

Every one of those values that collides has to hop around the set looking for an unoccupied bucket, the dense values don't collide at all, so they avoid this cost completely.

If you want a test that fixes both issues (while still using dense and sparse inputs), try it with floats (that aren't equivalent to int values, because float hashing tries to hash an int equivalent float to the same value as the int). To avoid disparate levels of actually equal values, select the inputs from non-overlapping values, so sparse vs. dense doesn't change the size of the resulting union. This is the code I used, which ends up with fairly uniform times, regardless of density:

import time import random import numpy def get_values(size, density, evens=True): if evens: # Divide by 100. to get floats with much more varied hashes vals = random.sample([x / 100. for x in xrange(0, int(size/density * 2), 2)], size) else: vals = random.sample([x / 100. for x in xrange(1, int(size/density * 2), 2)], size) return set(vals) def perform_op(size, density): values1 = get_values(size, density) values2 = get_values(size, density, False) # Select from non-overlapping values t = time.time() result = values1 | values2 return time.time()-t, len(result) size = 100000 for density in [0.05, 0.1, 0.5, 0.99]: times = [perform_op(size, density) for _ in range(10)] resultlens = [r for _, r in times] times = [t for t, _ in times] print('density: %.2f, mean time: %.4f, standard deviation: %.4f' % (density, numpy.mean(times), numpy.std(times))) print(numpy.mean(resultlens))

Thank you for your answer. I am not sure that the first point has a significant impact, computing the union of two sets takes a time O(len(values1) + len(values2)). I tested your code with even values for the two sets, I get similar times for all densities, except 0.99 which takes slightly longer (and not shorter, as expected).
But I think your second point is indeed correct (and greatly explained).
@TomCornebize: Yeah, point #1 is important in some cases, but it's not the main contributor here. It takes slightly longer if you don't account for it because set union presizes the result assuming the inputs are mostly unique (so tons of duplicates doesn't save on rehashing), meaning that duplicate entries cost you more in terms of equality comparisons, where non-duplicates just find an empty bucket and skip equality testing completely. If equality comparison were more expensive (while hashing remained cheap), dense values would take even longer.

Collectives™ on Stack Overflow

Strange performances for set operations

1 Answer 1

3 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Related