Example Problem
As a simple example, consider the numpy array arr as defined below:
import numpy as np arr = np.array([[5, np.nan, np.nan, 7, 2], [3, np.nan, 1, 8, np.nan], [4, 9, 6, np.nan, np.nan]]) where arr looks like this in console output:
array([[ 5., nan, nan, 7., 2.], [ 3., nan, 1., 8., nan], [ 4., 9., 6., nan, nan]]) I would now like to row-wise 'forward-fill' the nan values in array arr. By that I mean replacing each nan value with the nearest valid value from the left. The desired result would look like this:
array([[ 5., 5., 5., 7., 2.], [ 3., 3., 1., 8., 8.], [ 4., 9., 6., 6., 6.]]) Tried thus far
I've tried using for-loops:
for row_idx in range(arr.shape[0]): for col_idx in range(arr.shape[1]): if np.isnan(arr[row_idx][col_idx]): arr[row_idx][col_idx] = arr[row_idx][col_idx - 1] I've also tried using a pandas dataframe as an intermediate step (since pandas dataframes have a very neat built-in method for forward-filling):
import pandas as pd df = pd.DataFrame(arr) df.fillna(method='ffill', axis=1, inplace=True) arr = df.as_matrix() Both of the above strategies produce the desired result, but I keep on wondering: wouldn't a strategy that uses only numpy vectorized operations be the most efficient one?
Summary
Is there another more efficient way to 'forward-fill' nan values in numpy arrays? (e.g. by using numpy vectorized operations)
Update: Solutions Comparison
I've tried to time all solutions thus far. This was my setup script:
import numba as nb import numpy as np import pandas as pd def random_array(): choices = [1, 2, 3, 4, 5, 6, 7, 8, 9, np.nan] out = np.random.choice(choices, size=(1000, 10)) return out def loops_fill(arr): out = arr.copy() for row_idx in range(out.shape[0]): for col_idx in range(1, out.shape[1]): if np.isnan(out[row_idx, col_idx]): out[row_idx, col_idx] = out[row_idx, col_idx - 1] return out @nb.jit def numba_loops_fill(arr): '''Numba decorator solution provided by shx2.''' out = arr.copy() for row_idx in range(out.shape[0]): for col_idx in range(1, out.shape[1]): if np.isnan(out[row_idx, col_idx]): out[row_idx, col_idx] = out[row_idx, col_idx - 1] return out def pandas_fill(arr): df = pd.DataFrame(arr) df.fillna(method='ffill', axis=1, inplace=True) out = df.as_matrix() return out def numpy_fill(arr): '''Solution provided by Divakar.''' mask = np.isnan(arr) idx = np.where(~mask,np.arange(mask.shape[1]),0) np.maximum.accumulate(idx,axis=1, out=idx) out = arr[np.arange(idx.shape[0])[:,None], idx] return out followed by this console input:
%timeit -n 1000 loops_fill(random_array()) %timeit -n 1000 numba_loops_fill(random_array()) %timeit -n 1000 pandas_fill(random_array()) %timeit -n 1000 numpy_fill(random_array()) resulting in this console output:
1000 loops, best of 3: 9.64 ms per loop 1000 loops, best of 3: 377 µs per loop 1000 loops, best of 3: 455 µs per loop 1000 loops, best of 3: 351 µs per loop
nan?NaNuntouched. I would assume the OP wants the same behavior for consistency.nanvalues. So it's okay for me when the code (upon encounter of ananin the first column) either raises an exception or leaves thatnanin place.as_matrix(): the originalarris changed.