Quick way to check for duplicate arrays within list

Question

I have a list (let's call it all_my_arrays) that contains about 48,000 1D arrays. I'd like to know how many duplicate arrays are in this list, if any. However, I'd like to exclude empty arrays (because I have multiple empty arrays within the list and don't want those factored into my duplicate count). I tried this code below, but it is taking too long:

import numpy as np uniques=[] for arr in all_my_arrays: if not np.array_equal(np.array([]), arr): if not any(np.array_equal(arr, unique_arr) for unique_arr in uniques): uniques.append(arr) print(len(uniques)) #number of non-duplicates

Is there a much quicker way to accomplish this?

Maybe you could sort all_my_arrays or a shallow copy of it. Then you only have to compare each array with the previous one to eliminate duplicates. — Michael Butscher
– Michael Butscher, Commented Sep 5, 2018 at 23:57
Yes, the non-empty 1D arrays are all the same size (length of 7). — curious_cosmo
– curious_cosmo, Commented Sep 5, 2018 at 23:59
I have a list... A Python list of numpy arrays? What comprises a duplicate? Is array([1, 2, 3]) the same as array([3, 2, 1])? — dawg
– dawg, Commented Sep 6, 2018 at 0:34
@dawg 1st question: Yes, see the tags. 2nd question: No, duplicate arrays are those which return True to np.array_equal(array1, array2), which requires them to be the same length and that every element array1[i] and array2[i] is the same for each i within the range of length. — curious_cosmo
– curious_cosmo, Commented Sep 6, 2018 at 0:42

Henry · Accepted Answer · 2018-09-06 00:26:44Z

You can use the set type to get the unique values in your list. First you have to convert the arrays to hashable types (tuple here is good). Here's an example:

uniques = set(tuple(arr) for arr in all_my_arrays if arr.size > 0)

The set uniques will contain all the unique, non-empty arrays from your original all_my_arrays list. The contents of uniques are tuples, but you can convert them back to arrays with a list comprehension. If you're only interested in the number of unique arrays, then you can just call len(uniques) and not worry about converting back to arrays.

This approach has time complexity O(n + m) where n is the number of arrays and m is the length of each. There is however the overhead of converting to tuples, but I believe this method should be faster than what you have so far (which has time complexity O(n^2)) especially for such a large number of arrays.

Edit: To make this a bit faster, you can remove the empty check on each element and then just handle that at the end. Here's what that would look like:

uniques = set(tuple(arr) for arr in all_my_arrays) num_unique = len(uniques) if () not in uniques else len(uniques) - 1

Thanks, this ran very fast! Does this method preserve order (not required for the purposes of my question, just wondering)?
@curious_cosmo a set is an unordered collection, so it does not preserve order. You could check out collections.OrderedDict (putting tuples as the keys) or this implementation of OrderedSet if you're interested in preserving order.
@Alexander the example you've just pointed out (set(np.matrix([1,2,1,2]))) is quite different from the solution I have posted. Yours converts the matrix [1 2 1 2] into a set {1, 2}, while mine converts a comprehension of tuples into a set. So yours is one level beneath what I have described in my answer. A more appropriate example would be set([np.matrix([1,2,1,2]), np.matrix([1,2,1,3])]), which produces {(1,2,1,2), (1,2,1,3)}, which is actually in line with the question.
You are correct. Sorry for the confusion and thanks for the detailed explanation.

Daniel F · Accepted Answer · 2018-09-06 07:11:03Z

Just do

in_arr = np.array([i for i in all_my_arrays if i.size == 7]) uniques = np.unique(in_arr, axis = 0) uniques_list = list(uniques) # if you really want a list

EDIT: Beware that np.unique sorts internally, so order is not preserved. If you want to maintain order you'll probably need to build a speciality numba function.

jpp · Accepted Answer · 2018-09-06 00:18:28Z

Here's a NumPy-based solution which may work for you. For simplicity, I'll use an example where arrays are either of length 3 or 0.

Note that count of duplicates as per your definition equals total number of arrays minus number of empty arrays minus number of unique non-empty arrays.

The last part may be most expensive, but we can use np.unique on a regular NumPy array for this:

L = [np.array([1, 2, 3]), np.array([]), np.array([3, 2, 1]), np.array([1, 3, 2]), np.array([1, 2, 3]), np.array([1, 3, 2]), np.array([]), np.array([]), np.array([1, 2, 3])] empty_arr = {idx for idx, arr in enumerate(L) if arr.shape == (0,)} empty_count = len(empty_arr) A = np.array([arr for idx, arr in enumerate(L) if idx not in empty_arr]) unique_arr = np.unique(A, axis=0) duplicates = len(L) - len(empty_arr) - len(unique_arr) print(duplicates) # 3

dawg · Accepted Answer · 2018-09-06 01:15:13Z

I am not sure that I am understanding your problem fully.

However, if I understand:

You have a Python list of 7 element numpy arrays;
You want to filter out the arrays that are all 0;
You then want to get a unique count of what is left.

If that is correct, you can do this:

import numpy as np li_of_arr=[np.random.choice(4,7) for _ in range(50000)] # example list of arrays mat=np.array([e for e in li_of_arr if np.any(e)]) # create a numpy matrix of non zero arrays uniq=np.unique(mat,axis=0) # here are your unique sub arrays print (len(mat), len(uniq)) # len of non zero elements and unique non zero elements

This takes less than 1 second on my MacBook.

Collectives™ on Stack Overflow

Quick way to check for duplicate arrays within list

4 Answers 4

4 Comments

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

Comments

Comments

Comments

Linked

Related