2

I have a sparse dataframe with integer values. For example we create df as

df = pd.DataFrame(np.nan, index=range(10), columns=['A', 'B', 'C']) df.loc[(0,'A')] = 6 df.loc[(3,'A')] = 8 df.loc[(4,'B')] = 2 

and it looks like this

 A B C 0 6 NaN NaN 1 NaN NaN NaN 2 NaN NaN NaN 3 8 NaN NaN 4 NaN 2 NaN 5 NaN NaN NaN 6 NaN NaN NaN 7 NaN NaN NaN 8 NaN NaN NaN 9 NaN NaN NaN 

Now I want to recursively fill each nan value with the previous value -1 (if it is not nan). For example this code does the trick:

for j in range(len(df.index)): df = df.fillna(value=df.shift(1)-1, limit=1) 

and it produces

 A B C 0 6 NaN NaN 1 5 NaN NaN 2 4 NaN NaN 3 8 NaN NaN 4 7 2 NaN 5 6 1 NaN 6 5 0 NaN 7 4 -1 NaN 8 3 -2 NaN 9 2 -3 NaN 

The problem is that this code applied to a "real" dataframe is slow as hell, even if I have a bound on the range of j. Since it looks like very close to a simple df.fillna(method='ffill'), which is way faster, I was wondering if there is a way to speed this process up.

Thanks in advance for any answer, insight or comment.

2 Answers 2

2

This is not a general solution but should produce the expected output in your particular case:

for col in df.columns: g = df[col].notnull().cumsum() df[col] = df[col].fillna(method='ffill') - df[col].groupby(g).cumcount() 

Basically you fill forward and then subtract the number of consecutive nans after the last nonnull value.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks gereleth! It works, and it is indeed much faster. Do you know whether the difference in performance comes from not passing a dataframe to the 'value' argument of fillna at each step?
I think your code does two whole-dataframe operations for every row - first shift and then fillna. That's a lot more number-crunching than necessary =).
0

My comparisons on your toy problem suggest the below code is quicker than yours and the accepted answer; your mileage may vary on your actual problem.

for col,series in df.iteritems(): reference = series[0] for idx,val in series.iteritems(): if np.isnan(val): reference = reference - 1 series[idx] = reference else: reference = val 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.