1

I need to replace NaN with values from the previous row except for the first row where NaN values are replaced with zero. What would be the most efficient solution?

Sample input, output -

In [179]: arr Out[179]: array([[ 5., nan, nan, 7., 2., 6., 5.], [ 3., nan, 1., 8., nan, 5., nan], [ 4., 9., 6., nan, nan, nan, 7.]]) In [180]: out Out[180]: array([[ 5., 0, 0., 7., 2., 6., 5.], [ 3., 0, 1., 8., 2., 5., 5.], [ 4., 9., 6., 8., 2., 6., 7.]]) 
1
  • 1
    The output does not match your description. Where there is a nan above another nan you actually do not want the value from the row above, but the first non-nan value above it - columnwise. So, which one is it? Commented May 27, 2020 at 8:37

5 Answers 5

4

(EDIT to include a (partially?) vectorized approach)

(EDIT2 to include some timings)

The simplest solution matching your required input/output is by looping through the rows:

import numpy as np def ffill_loop(arr, fill=0): mask = np.isnan(arr[0]) arr[0][mask] = fill for i in range(1, len(arr)): mask = np.isnan(arr[i]) arr[i][mask] = arr[i - 1][mask] return arr print(ffill_loop(arr.copy())) # [[5. 0. 0. 7. 2. 6. 5.] # [3. 0. 1. 8. 2. 5. 5.] # [4. 9. 6. 8. 2. 5. 7.]] 

You could also use a vectorized approach which may come faster for larger inputs (the fewer the nan below each other, the better):

import numpy as np def ffill_roll(arr, fill=0, axis=0): mask = np.isnan(arr) replaces = np.roll(arr, 1, axis) slicing = tuple(0 if i == axis else slice(None) for i in range(arr.ndim)) replaces[slicing] = fill while np.count_nonzero(mask) > 0: arr[mask] = replaces[mask] mask = np.isnan(arr) replaces = np.roll(replaces, 1, axis) return arr print(ffill_roll(arr.copy())) # [[5. 0. 0. 7. 2. 6. 5.] # [3. 0. 1. 8. 2. 5. 5.] # [4. 9. 6. 8. 2. 5. 7.]] 

Timing these function one would get (including the loop-less solution proposed in @Divakar's answer):

import numpy as np from numpy import nan funcs = ffill_loop, ffill_roll, ffill_cols sep = ' ' * 4 print(f'{"shape":15s}', end=sep) for func in funcs: print(f'{func.__name__:>15s}', end=sep) print() for n in (1, 5, 10, 50, 100, 500, 1000, 2000): k = l = n arr = np.array([[ 5., nan, nan, 7., 2., 6., 5.] * k, [ 3., nan, 1., 8., nan, 5., nan] * k, [ 4., 9., 6., nan, nan, nan, 7.] * k] * l) print(f'{arr.shape!s:15s}', end=sep) for func in funcs: result = %timeit -q -o func(arr.copy()) print(f'{result.best * 1e3:12.3f} ms', end=sep) print() 
shape ffill_loop ffill_roll ffill_cols (3, 7) 0.009 ms 0.063 ms 0.026 ms (15, 35) 0.043 ms 0.074 ms 0.034 ms (30, 70) 0.092 ms 0.098 ms 0.055 ms (150, 350) 0.783 ms 0.939 ms 0.786 ms (300, 700) 2.409 ms 4.060 ms 3.829 ms (1500, 3500) 49.447 ms 105.379 ms 169.649 ms (3000, 7000) 169.799 ms 340.548 ms 759.854 ms (6000, 14000) 656.982 ms 1369.651 ms 1610.094 ms 

Indicating that ffill_loop() is actually the fastest for the given inputs most of the times. Instead ffill_cols() gets progressively to be the slowest approach as the input size increases.

Sign up to request clarification or add additional context in comments.

4 Comments

@Divakar Yes it is. Vectorization and looping are not mutually exclusive. Sometimes it is beneficial to vectorize only some part of the algorithm. While the first approach requires looping proportional to the input size, the second approach only loops if there are np.nan values below each other, so it is not dependent on the input size.
My contention is just regarding terminology. If you think most of the computation is outside of that while-loop or with the first iteration, you could call it partly or partial vectorization. At least that's how I go with terminology.
@Divakar I would consider a portion of code vectorized if it does not include looping over array dimensions. And would use partially vectorized if it loops only along some dims but not others. I would not know which of the two nomenclature for partial vectorization is most used.
3

Here's a vectorized NumPy based one inspired by Most efficient way to forward-fill NaN values in numpy array's answer post -

def ffill_cols(a, startfillval=0): mask = np.isnan(a) tmp = a[0].copy() a[0][mask[0]] = startfillval mask[0] = False idx = np.where(~mask,np.arange(mask.shape[0])[:,None],0) out = np.take_along_axis(a,np.maximum.accumulate(idx,axis=0),axis=0) a[0] = tmp return out 

Sample run -

In [2]: a Out[2]: array([[ 5., nan, nan, 7., 2., 6., 5.], [ 3., nan, 1., 8., nan, 5., nan], [ 4., 9., 6., nan, nan, nan, 7.]]) In [3]: ffill_cols(a) Out[3]: array([[5., 0., 0., 7., 2., 6., 5.], [3., 0., 1., 8., 2., 5., 5.], [4., 9., 6., 8., 2., 5., 7.]]) 

Comments

1
import numpy as np arr = np.array([[ 5., np.nan, np.nan, 7., 2., 6., 5.], [ 3., np.nan, 1., 8., np.nan, 5., np.nan], [ 4., 9., 6., np.nan, np.nan, np.nan, 7.]]) nan_indices = np.isnan(arr) 

Where nan_indices gives you:

array([[False, True, True, False, False, False, False], [False, True, False, False, True, False, True], [False, False, False, True, True, True, False]]) 

Now it's just a matter of replacing the values using the logic you mentioned in the question:

arr[0, nan_indices[0, :]] = 0 for row in range(1, np.shape(arr)[0]): arr[row, nan_indices[row, :]] = arr[row - 1, nan_indices[row, :]] 

Now arr is:

array([[5., 0., 0., 7., 2., 6., 5.], [3., 0., 1., 8., 2., 5., 5.], [4., 9., 6., 8., 2., 5., 7.]]) 

Comments

0
from numpy import * a = array([[5., nan, nan, 7., 2., 6., 5.], [3., nan, 1., 8., nan, 5., nan], [4., 9., 6., nan, nan, nan, 7.]]) 

replace nan with zeros in first row

where_are_NaNs = isnan(a[0]) a[0][where_are_NaNs] = 0 

replace nan in other rows

where_are_NaNs = isnan(a) for i in range(len(where_are_NaNs)): for j in range(len(where_are_NaNs[0])): if(where_are_NaNs[i][j]): a[i][j] = a[i-1][j] 

Comments

0

How about this?

import numpy as np x = np.array([[ 5., np.nan, np.nan, 7., 2., 6., 5.], [ 3., np.nan, 1., 8., np.nan, 5., np.nan], [ 4., 9., 6., np.nan, np.nan, np.nan, 7.]]) def fillnans(a): a[0, np.isnan(a[0,:])] = 0 while np.any(np.isnan(a)): a[np.isnan(a)] = np.roll(a, 1, 0)[np.isnan(a)] return a print(x) print(fillnans(x)) 

Output

[[ 5. nan nan 7. 2. 6. 5.] [ 3. nan 1. 8. nan 5. nan] [ 4. 9. 6. nan nan nan 7.]] [[5. 0. 0. 7. 2. 6. 5.] [3. 0. 1. 8. 2. 5. 5.] [4. 9. 6. 8. 2. 5. 7.]] 

I hope this helps!

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.