Comparing two matrices in Python, and filling the missed rows

Question

I have two matrices of features, that have different number of rows. Suppose matrix A has more rows than matrix B. The columns of matrices includes ID1, ID2, Time_slice, feature value. Since for some Time_slice there is no feature value in B, the number of rows in B is less than A. I require to find which rows are missed in B. Then add rows to B with related ID1, ID2 values, and zero for the feature.

 ID1 ID2 Time_slice Feature A= array([[ 100, 1., 0., 1.5], [ 100, 1., 1., 3.7], [ 100, 2., 0., 1.2], [ 100, 2., 1., 1.8], [ 100, 2., 2., 2.9], [ 101, 3., 0., 1.5], [ 101, 3., 1., 3.7], [ 101, 4., 0., 1.2], [ 101, 4., 1., 1.8], [ 101, 4., 2., 2.9]]) B= array([[ 100, 1., 0., 1.25], [ 100, 1., 1., 3.37], [ 100, 2., 0., 1.42], [ 100, 2., 1., 1.68]]) Output should be as follow: [[ 100, 1., 0., 1.25], [ 100, 1., 1., 3.37], [ 100, 2., 0., 1.42], [ 100, 2., 1., 1.68], [ 100, 2., 2., 0 ], [ 101, 3., 0., 0], [ 101, 3., 1., 0], [ 101, 4., 0., 0], [ 101, 4., 1., 0], [ 101, 4., 2., 0]])

Community · Accepted Answer · 2017-05-23 11:52:01Z

It appears (from the desired output) that a row in A is thought to match a row in B if the first three columns are equal. Your problem would be in large part solved if we could identify which rows of A match rows of B.

If identifying matches simply depended on values from a single column, then we could use np.in1d. For example, if [0, 1, 2, 5 ,0] were values in A and [0, 2] were values in B, then

In [39]: np.in1d([0, 1, 2, 5, 0], [0, 2]) Out[39]: array([ True, False, True, False, True], dtype=bool)

shows which rows of A match rows of B.

There is (currently) no higher-dimensional generalization of this function in NumPy.

There is a trick, however, which can be used to view multiple columns of a 2D array as a single column of byte values -- thus turning a 2D array into a 1D array. We can then apply np.in1d to this 1D array. The trick, which I learned from Jaime, is here encapsulated in the function, asvoid:

import numpy as np def asvoid(arr): """ View the array as dtype np.void (bytes). This views the last axis of ND-arrays as np.void (bytes) so comparisons can be performed on the entire row. https://stackoverflow.com/a/16840350/190597 (Jaime, 2013-05) Some caveats: - `asvoid` will work for integer dtypes, but be careful if using asvoid on float dtypes, since float zeros may compare UNEQUALLY: >>> asvoid([-0.]) == asvoid([0.]) array([False], dtype=bool) - `asvoid` works best on contiguous arrays. If the input is not contiguous, `asvoid` will copy the array to make it contiguous, which will slow down the performance. """ arr = np.ascontiguousarray(arr) return arr.view(np.dtype((np.void, arr.dtype.itemsize * arr.shape[-1]))) A = np.array([[ 100, 1., 0., 1.5], [ 100, 1., 1., 3.7], [ 100, 2., 0., 1.2], [ 100, 2., 1., 1.8], [ 100, 2., 2., 2.9], [ 101, 3., 0., 1.5], [ 101, 3., 1., 3.7], [ 101, 4., 0., 1.2], [ 101, 4., 1., 1.8], [ 101, 4., 2., 2.9]]) B = np.array([[ 100, 1., 0., 1.25], [ 100, 1., 1., 3.37], [ 100, 2., 0., 1.42], [ 100, 2., 1., 1.68]]) mask = np.in1d(asvoid(A[:, :3]), asvoid(B[:, :3])) result = A[~mask] result[:, -1] = 0 result = np.row_stack([B, result]) print(result)

yields

[[ 100. 1. 0. 1.25] [ 100. 1. 1. 3.37] [ 100. 2. 0. 1.42] [ 100. 2. 1. 1.68] [ 100. 2. 2. 0. ] [ 101. 3. 0. 0. ] [ 101. 3. 1. 0. ] [ 101. 4. 0. 0. ] [ 101. 4. 1. 0. ] [ 101. 4. 2. 0. ]]

triiiiista · Accepted Answer · 2016-01-20 18:49:36Z

You can try something like:

import numpy as np A = np.array([[ 100, 1., 0., 1.5], [ 100, 1., 1., 3.7], [ 100, 2., 0., 1.2], [ 100, 2., 1., 1.8], [ 100, 2., 2., 2.9], [ 101, 3., 0., 1.5], [ 101, 3., 1., 3.7], [ 101, 4., 0., 1.2], [ 101, 4., 1., 1.8], [ 101, 4., 2., 2.9]]) B = np.array([[ 100, 1., 0., 1.25], [ 100, 1., 1., 3.37], [ 100, 2., 0., 1.42], [ 100, 2., 1., 1.68]]) listB = B.tolist() for rowA in A: if rowA.tolist not in listB: B = np.append(B, [[rowA[0], rowA[1], rowA[2], 0]], axis=0) print B

It would be better to append, or insert the new rows into a list (listB or a new copy), and rebuild the array at the end. Repeated use of np.append is slow.
I did some changes in your code to compare the first three columns, and I tried the same method as @hpaulj said to optimize the speed. Thanks.
@hpaulj, yes repeated use of np.append is slow. Work on the list and finalize it to an array at the end is better for perf. I wasn't thinking about perf when coming out with a quick answer. Thanks.

Collectives™ on Stack Overflow

Comparing two matrices in Python, and filling the missed rows

2 Answers 2

Comments

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

3 Comments

Linked

Related