4

I have a list (let's call it all_my_arrays) that contains about 48,000 1D arrays. I'd like to know how many duplicate arrays are in this list, if any. However, I'd like to exclude empty arrays (because I have multiple empty arrays within the list and don't want those factored into my duplicate count). I tried this code below, but it is taking too long:

import numpy as np uniques=[] for arr in all_my_arrays: if not np.array_equal(np.array([]), arr): if not any(np.array_equal(arr, unique_arr) for unique_arr in uniques): uniques.append(arr) print(len(uniques)) #number of non-duplicates 

Is there a much quicker way to accomplish this?

7
  • Are these arrays all the same size when they are non-empty? Commented Sep 5, 2018 at 23:57
  • Maybe you could sort all_my_arrays or a shallow copy of it. Then you only have to compare each array with the previous one to eliminate duplicates. Commented Sep 5, 2018 at 23:57
  • Yes, the non-empty 1D arrays are all the same size (length of 7). Commented Sep 5, 2018 at 23:59
  • 2
    I have a list... A Python list of numpy arrays? What comprises a duplicate? Is array([1, 2, 3]) the same as array([3, 2, 1])? Commented Sep 6, 2018 at 0:34
  • @dawg 1st question: Yes, see the tags. 2nd question: No, duplicate arrays are those which return True to np.array_equal(array1, array2), which requires them to be the same length and that every element array1[i] and array2[i] is the same for each i within the range of length. Commented Sep 6, 2018 at 0:42

4 Answers 4

2

You can use the set type to get the unique values in your list. First you have to convert the arrays to hashable types (tuple here is good). Here's an example:

uniques = set(tuple(arr) for arr in all_my_arrays if arr.size > 0) 

The set uniques will contain all the unique, non-empty arrays from your original all_my_arrays list. The contents of uniques are tuples, but you can convert them back to arrays with a list comprehension. If you're only interested in the number of unique arrays, then you can just call len(uniques) and not worry about converting back to arrays.

This approach has time complexity O(n + m) where n is the number of arrays and m is the length of each. There is however the overhead of converting to tuples, but I believe this method should be faster than what you have so far (which has time complexity O(n^2)) especially for such a large number of arrays.

Edit: To make this a bit faster, you can remove the empty check on each element and then just handle that at the end. Here's what that would look like:

uniques = set(tuple(arr) for arr in all_my_arrays) num_unique = len(uniques) if () not in uniques else len(uniques) - 1 
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks, this ran very fast! Does this method preserve order (not required for the purposes of my question, just wondering)?
@curious_cosmo a set is an unordered collection, so it does not preserve order. You could check out collections.OrderedDict (putting tuples as the keys) or this implementation of OrderedSet if you're interested in preserving order.
@Alexander the example you've just pointed out (set(np.matrix([1,2,1,2]))) is quite different from the solution I have posted. Yours converts the matrix [1 2 1 2] into a set {1, 2}, while mine converts a comprehension of tuples into a set. So yours is one level beneath what I have described in my answer. A more appropriate example would be set([np.matrix([1,2,1,2]), np.matrix([1,2,1,3])]), which produces {(1,2,1,2), (1,2,1,3)}, which is actually in line with the question.
You are correct. Sorry for the confusion and thanks for the detailed explanation.
1

Just do

in_arr = np.array([i for i in all_my_arrays if i.size == 7]) uniques = np.unique(in_arr, axis = 0) uniques_list = list(uniques) # if you really want a list 

EDIT: Beware that np.unique sorts internally, so order is not preserved. If you want to maintain order you'll probably need to build a speciality numba function.

Comments

0

Here's a NumPy-based solution which may work for you. For simplicity, I'll use an example where arrays are either of length 3 or 0.

Note that count of duplicates as per your definition equals total number of arrays minus number of empty arrays minus number of unique non-empty arrays.

The last part may be most expensive, but we can use np.unique on a regular NumPy array for this:

L = [np.array([1, 2, 3]), np.array([]), np.array([3, 2, 1]), np.array([1, 3, 2]), np.array([1, 2, 3]), np.array([1, 3, 2]), np.array([]), np.array([]), np.array([1, 2, 3])] empty_arr = {idx for idx, arr in enumerate(L) if arr.shape == (0,)} empty_count = len(empty_arr) A = np.array([arr for idx, arr in enumerate(L) if idx not in empty_arr]) unique_arr = np.unique(A, axis=0) duplicates = len(L) - len(empty_arr) - len(unique_arr) print(duplicates) # 3 

Comments

0

I am not sure that I am understanding your problem fully.

However, if I understand:

  1. You have a Python list of 7 element numpy arrays;
  2. You want to filter out the arrays that are all 0;
  3. You then want to get a unique count of what is left.

If that is correct, you can do this:

import numpy as np li_of_arr=[np.random.choice(4,7) for _ in range(50000)] # example list of arrays mat=np.array([e for e in li_of_arr if np.any(e)]) # create a numpy matrix of non zero arrays uniq=np.unique(mat,axis=0) # here are your unique sub arrays print (len(mat), len(uniq)) # len of non zero elements and unique non zero elements 

This takes less than 1 second on my MacBook.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.