Speed up counting number of distinct columns

Question

I need to count the number of distinct columns in relatively large arrays.

def nodistinctcols(M): setofcols = set() for column in M.T: setofcols.add(repr(column)) return len(setofcols) X = np.array([np.random.randint(2, size = 16) for i in xrange(2**16)]) print "nodistinctcols(X.T)", nodistinctcols(X.T)

The last line takes 20 seconds on my computer which seems excessively slow. By contrast X = np.array([np.random.randint(2, size = 16) for i in xrange(2**16)]) takes 216 ms. Can nodistinctcols be sped up?

unutbu · Accepted Answer · 2014-03-30 22:15:04Z

You can use view to change the dtype of M so that an entire row (or column) will be viewed as an array of bytes. Then np.unique can be applied to find the unique values:

import numpy as np def asvoid(arr): """ View the array as dtype np.void (bytes). This views the last axis of ND-arrays as np.void (bytes) so comparisons can be performed on the entire row. http://stackoverflow.com/a/16840350/190597 (Jaime, 2013-05) Some caveats: - `asvoid` will work for integer dtypes, but be careful if using asvoid on float dtypes, since float zeros may compare UNEQUALLY: >>> asvoid([-0.]) == asvoid([0.]) array([False], dtype=bool) - `asvoid` works best on contiguous arrays. If the input is not contiguous, `asvoid` will copy the array to make it contiguous, which will slow down the performance. """ arr = np.ascontiguousarray(arr) return arr.view(np.dtype((np.void, arr.dtype.itemsize * arr.shape[-1]))) def nodistinctcols(M): MT = asvoid(M.T) uniqs = np.unique(MT) return len(uniqs) X = np.array([np.random.randint(2, size = 16) for i in xrange(2**16)]) print("nodistinctcols(X.T) {}".format(nodistinctcols(X.T)))

Benchmark:

In [20]: %timeit nodistinctcols(X.T) 10 loops, best of 3: 63.6 ms per loop In [21]: %timeit nodistinctcols_orig(X.T) 1 loops, best of 3: 17.4 s per loop

where nodistinctcols_orig is defined by:

def nodistinctcols_orig(M): setofcols = set() for column in M.T: setofcols.add(repr(column)) return len(setofcols)

Sanity check passes:

In [24]: assert nodistinctcols(X.T) == nodistinctcols_orig(X.T)

By the way, it might make more sense to define

def num_distinct_rows(M): return len(np.unique(asvoid(M)))

and simply pass M.T to the function when you wish to count the number of distinct columns. That way, the function would not be slowed down by an unnecessary transpose if you wish to use it to count the number of distinct rows.

That is a truly amazing speed up! The great thing about python is that there is always another way to do anything that is 50 times faster :)
@felix: regarding your other question about circular plots: here is a python-based solution -- not nearly as fancy, but if you are generous, could be called somewhat similar :).

DSM · Accepted Answer · 2014-03-31 19:39:01Z

Just for future reference, don't sleep on old-fashioned approaches like using set. Will it be as fast and memory-efficient as a clever numpy approach? No. But often it's good enough for now, which is nothing to sneeze at when you're on the clock.

In [25]: %time slow = nodistinctcols(X.T) CPU times: user 28.2 s, sys: 12 ms, total: 28.2 s Wall time: 28.2 s In [26]: %time medium = len(set(map(tuple, X))) CPU times: user 324 ms, sys: 0 ns, total: 324 ms Wall time: 322 ms In [27]: slow == medium Out[27]: True

What's slow wasn't the set part, it was the string conversion.

jtaylor · Accepted Answer · 2014-03-31 19:45:25Z

if you have less rows than columns you can also do multiple stable sorts along the rows and the count the uniques

def count(x): x = x.copy() x = x[x[:,0].argsort()] # first sort can be unstable for i in range(1, x.shape[1]): x = x[x[:,i].argsort(kind='mergesort')] # stable sorts now # x is now sorted so that equal columns are next to each other # -> compare neighboors with each others and count all-true columns return x.shape[0] - np.count_nonzero((x[1:, :] == x[:-1,:]).all(axis=1))

with numpy 1.9.dev its faster than the void compare, with older numpys the indexing kills the performance (about 4 times slower than void)

X = np.array([np.random.randint(2, size = 16) for i in xrange(2**16)]) In [6]: %timeit count(X) 10 loops, best of 3: 144 ms per loop Xt = X.T.copy() In [8]: %timeit unutbu_void(Xt) 10 loops, best of 3: 161 ms per loop

Collectives™ on Stack Overflow

Speed up counting number of distinct columns

3 Answers 3

2 Comments

1 Comment

Comments

Linked

Hot Network Questions