3

I have the following time series dataframe. I would like to fill the missing values with the previous value. However i would only want to fill the missing values between the first_valid_index and the last_valid index. So the columns i wanna fill will be different for each row. How can i do this?

So, given this dataframe.

import numpy as np import pandas as pd df = pd.DataFrame([[1, 2 ,3,np.nan,5], [1, 3 , np.nan , 4 , np.nan], [4, np.nan , 7 , np.nan,np.nan]], columns=[2007,2008,2009,2010,2011]) 

Input dataframe:

 2007 2008 2009 2010 2011 1 2 3 NaN 5 1 3 NaN 4 NaN 4 Nan 7 NaN NaN 

Output dataframe:

2007 2008 2009 2010 2011 1 2 3 3 5 1 3 3 4 NaN 4 4 7 NaN NaN 

I thought of creating new columns for first_valid_index and last_valid_index and then using .apply() but how can i fill different columns per row?

def fillMissing(x): first_valid = int(x["first_valid"]) last_valid = int(x["last_valid"]) for i in range(first_valid,last_valid + 1): missing.append(i) #What should i do here since the following is not valid #x[missing] = x[missing].fillna(method='ffill', axis=1) df.apply(fillMissing , axis=1) 
2
  • I don't see different columns for each rows in your example Commented Dec 26, 2017 at 6:25
  • @GarbageCollector Thanks. I edited the example. So notice how i wanna fillna between 2007 and 2011 in the first row. But i only wanna fillna between 2007 and 2009 in the 3rd row. Commented Dec 26, 2017 at 6:29

2 Answers 2

5

You can do this with iloc but I prefer to do this with Numpy. Essentially, use ffill to forward fill values, then mask the values that are NaN all the way to the end.

v = df.values mask = np.logical_and.accumulate( np.isnan(v)[:, ::-1], axis=1)[:, ::-1] df.ffill(axis=1).mask(mask) 2007 2008 2009 2010 2011 0 1.0 2.0 3.0 3.0 5.0 1 1.0 3.0 3.0 4.0 NaN 2 4.0 4.0 7.0 NaN NaN 
Sign up to request clarification or add additional context in comments.

Comments

1

Here are two completely NumPy based ones, inspired by this post -

def app1(df): # Same as in the linked post arr = df.values m,n = arr.shape r = np.arange(n) mask = np.isnan(arr) idx = np.where(~mask,r,0) idx = np.maximum.accumulate(idx,axis=1) out = arr[np.arange(m)[:,None], idx] # Additional part to keep the trailing NaN islands and output a dataframe out[(n - mask[:,::-1].argmin(1))[:,None] <= r] = np.nan return pd.DataFrame(out, columns=df.columns) def app2(df): arr = df.values m,n = arr.shape r = np.arange(m) mask = np.isnan(arr) idx = np.where(~mask,np.arange(n),0) put_idx = n - mask[:,::-1].argmin(1) v = put_idx < n rv = r[v] idx[rv,put_idx[v]] = idx[rv,(put_idx-1)[v]]+1 idx = np.maximum.accumulate(idx,axis=1) out = arr[r[:,None], idx] return pd.DataFrame(out, columns=df.columns) 

Sample runs -

In [246]: df Out[246]: 2007 2008 2009 2010 2011 0 1 2.0 3.0 NaN 5.0 1 1 3.0 NaN 4.0 NaN 2 4 NaN 7.0 NaN NaN In [247]: app1(df) Out[247]: 2007 2008 2009 2010 2011 0 1.0 2.0 3.0 3.0 5.0 1 1.0 3.0 3.0 4.0 NaN 2 4.0 4.0 7.0 NaN NaN In [248]: app2(df) Out[248]: 2007 2008 2009 2010 2011 0 1.0 2.0 3.0 3.0 5.0 1 1.0 3.0 3.0 4.0 NaN 2 4.0 4.0 7.0 NaN NaN 

Runtime test on bigger df with 50% NaNs filled in -

In [249]: df = pd.DataFrame(np.random.randint(1,9,(5000,5000)).astype(float)) In [250]: idx = np.random.choice(df.size, df.size//2, replace=0) In [251]: df.values.ravel()[idx] = np.nan # @piRSquared's soln In [252]: %%timeit ...: v = df.values ...: mask = np.logical_and.accumulate( ...: np.isnan(v)[:, ::-1], axis=1)[:, ::-1] ...: df.ffill(axis=1).mask(mask) 1 loop, best of 3: 473 ms per loop In [253]: %timeit app1(df) 1 loop, best of 3: 353 ms per loop In [254]: %timeit app2(df) 1 loop, best of 3: 330 ms per loop 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.