why does pandas rolling use single dimension ndarray

Question

I was motivated to use pandas rolling feature to perform a rolling multi-factor regression (This question is NOT about rolling multi-factor regression). I expected that I'd be able to use apply after a df.rolling(2) and take the resulting pd.DataFrame extract the ndarray with .values and perform the requisite matrix multiplication. It didn't work out that way.

Here is what I found:

import pandas as pd import numpy as np np.random.seed([3,1415]) df = pd.DataFrame(np.random.rand(5, 2).round(2), columns=['A', 'B']) X = np.random.rand(2, 1).round(2)

What do objects look like:

print "\ndf = \n", df print "\nX = \n", X print "\ndf.shape =", df.shape, ", X.shape =", X.shape df = A B 0 0.44 0.41 1 0.46 0.47 2 0.46 0.02 3 0.85 0.82 4 0.78 0.76 X = [[ 0.93] [ 0.83]] df.shape = (5, 2) , X.shape = (2L, 1L)

Matrix multiplication behaves normally:

df.values.dot(X) array([[ 0.7495], [ 0.8179], [ 0.4444], [ 1.4711], [ 1.3562]])

Using apply to perform row by row dot product behaves as expected:

df.apply(lambda x: x.values.dot(X)[0], axis=1) 0 0.7495 1 0.8179 2 0.4444 3 1.4711 4 1.3562 dtype: float64

Groupby -> Apply behaves as I'd expect:

df.groupby(level=0).apply(lambda x: x.values.dot(X)[0, 0]) 0 0.7495 1 0.8179 2 0.4444 3 1.4711 4 1.3562 dtype: float64

But when I run:

df.rolling(1).apply(lambda x: x.values.dot(X))

I get:

AttributeError: 'numpy.ndarray' object has no attribute 'values'

Ok, so pandas is using straight ndarray within its rolling implementation. I can handle that. Instead of using .values to get the ndarray, let's try:

df.rolling(1).apply(lambda x: x.dot(X))

shapes (1,) and (2,1) not aligned: 1 (dim 0) != 2 (dim 0)

Wait! What?!

So I created a custom function to look at the what rolling is doing.

def print_type_sum(x): print type(x), x.shape return x.sum()

Then ran:

print df.rolling(1).apply(print_type_sum) <type 'numpy.ndarray'> (1L,) <type 'numpy.ndarray'> (1L,) <type 'numpy.ndarray'> (1L,) <type 'numpy.ndarray'> (1L,) <type 'numpy.ndarray'> (1L,) <type 'numpy.ndarray'> (1L,) <type 'numpy.ndarray'> (1L,) <type 'numpy.ndarray'> (1L,) <type 'numpy.ndarray'> (1L,) <type 'numpy.ndarray'> (1L,) A B 0 0.44 0.41 1 0.46 0.47 2 0.46 0.02 3 0.85 0.82 4 0.78 0.76

My resulting pd.DataFrame is the same, that's good. But it printed out 10 single dimensional ndarray objects. What about rolling(2)

print df.rolling(2).apply(print_type_sum) <type 'numpy.ndarray'> (2L,) <type 'numpy.ndarray'> (2L,) <type 'numpy.ndarray'> (2L,) <type 'numpy.ndarray'> (2L,) <type 'numpy.ndarray'> (2L,) <type 'numpy.ndarray'> (2L,) <type 'numpy.ndarray'> (2L,) <type 'numpy.ndarray'> (2L,) A B 0 NaN NaN 1 0.90 0.88 2 0.92 0.49 3 1.31 0.84 4 1.63 1.58

Same thing, expect output but it printed 8 ndarray objects. rolling is producing a single dimensional ndarray of length window for each column as opposed to what I expected which was an ndarray of shape (window, len(df.columns)).

Question is Why?

I now don't have a way to easily run a rolling multi-factor regression.

This is a known issue. I recently asked Jeff about it, you can read his answer in the comments! — IanS
– IanS, Commented May 27, 2016 at 15:12
What is the state-of-art solution as of Pandas 0.20? Seems like lots of improvements have been made. Is the objective in OP achievable using rolling().apply() directly? — Zhang18
– Zhang18, Commented Jun 15, 2017 at 14:45

Community · Accepted Answer · 2017-05-23 12:09:29Z

I wanted to share what I've done to work around this problem.

Given a pd.DataFrame and a window, I generate a stacked ndarray using np.dstack (see answer). I then convert it to a pd.Panel and using pd.Panel.to_frame convert it to a pd.DataFrame. At this point, I have a pd.DataFrame that has an additional level on its index relative to the original pd.DataFrame and the new level contains information about each rolled period. For example, if the roll window is 3, the new index level will contain be [0, 1, 2]. An item for each period. I can now groupby level=0 and return the groupby object. This now gives me an object that I can much more intuitively manipulate.

Roll Function

import pandas as pd import numpy as np def roll(df, w): roll_array = np.dstack([df.values[i:i+w, :] for i in range(len(df.index) - w + 1)]).T panel = pd.Panel(roll_array, items=df.index[w-1:], major_axis=df.columns, minor_axis=pd.Index(range(w), name='roll')) return panel.to_frame().unstack().T.groupby(level=0)

Demonstration

np.random.seed([3,1415]) df = pd.DataFrame(np.random.rand(5, 2).round(2), columns=['A', 'B']) print df A B 0 0.44 0.41 1 0.46 0.47 2 0.46 0.02 3 0.85 0.82 4 0.78 0.76

Let's sum

rolled_df = roll(df, 2) print rolled_df.sum() major A B 1 0.90 0.88 2 0.92 0.49 3 1.31 0.84 4 1.63 1.58

To peek under the hood, we can see the stucture:

print rolled_df.apply(lambda x: x) major A B roll 1 0 0.44 0.41 1 0.46 0.47 2 0 0.46 0.47 1 0.46 0.02 3 0 0.46 0.02 1 0.85 0.82 4 0 0.85 0.82 1 0.78 0.76

But what about the purpose for which I built this, rolling multi-factor regression. But I'll settle for matrix multiplication for now.

X = np.array([2, 3]) print rolled_df.apply(lambda df: pd.Series(df.values.dot(X))) 0 1 1 2.11 2.33 2 2.33 0.98 3 0.98 4.16 4 4.16 3.84

This was very helpful, thanks. I ran into a little trouble with nan values, but updating the last line of the roll function to use .to_frame(filter_observations=False) fixed my issue.
This is helpful. But is there way to make the 'roll' column keep the original index? e.g. 0.46 0.47 is always associated with "1". Thank you.
This is cool, but why the heck isn't this a feature of Pandas?

Community · Accepted Answer · 2017-05-23 12:16:53Z

Using the strides views concept on dataframe, here's a vectorized approach -

get_sliding_window(df, 2).dot(X) # window size = 2

Runtime test -

In [101]: df = pd.DataFrame(np.random.rand(5, 2).round(2), columns=['A', 'B']) In [102]: X = np.array([2, 3]) In [103]: rolled_df = roll(df, 2) In [104]: %timeit rolled_df.apply(lambda df: pd.Series(df.values.dot(X))) 100 loops, best of 3: 5.51 ms per loop In [105]: %timeit get_sliding_window(df, 2).dot(X) 10000 loops, best of 3: 43.7 µs per loop

Verify results -

In [106]: rolled_df.apply(lambda df: pd.Series(df.values.dot(X))) Out[106]: 0 1 1 2.70 4.09 2 4.09 2.52 3 2.52 1.78 4 1.78 3.50 In [107]: get_sliding_window(df, 2).dot(X) Out[107]: array([[ 2.7 , 4.09], [ 4.09, 2.52], [ 2.52, 1.78], [ 1.78, 3.5 ]])

Huge improvement there, which I am hoping would stay noticeable on larger arrays!

i get an unsolved reference error when trying to use get_sliding_window

Gustav Engström · Accepted Answer · 2017-07-24 17:14:14Z

Made the following modifications to the above answer since I needed to return the entire rolling window as is done in pd.DataFrame.rolling()

def roll(df, w): roll_array = np.dstack([df.values[i:i+w, :] for i in range(len(df.index) - w + 1)]).T roll_array_full_window = np.vstack((np.empty((w-1 ,len(df.columns), w)), roll_array)) panel = pd.Panel(roll_array_full_window, items=df.index, major_axis=df.columns, minor_axis=pd.Index(range(w), name='roll')) return panel.to_frame().unstack().T.groupby(level=0)

gosuto · Accepted Answer · 2018-08-27 05:30:51Z

Since pandas v0.23 it is now possible to pass a Series instead of a ndarray to Rolling.apply(). Just set raw=False.

raw : bool, default None

False : passes each row or column as a Series to the function.

True or None : the passed function will receive ndarray objects instead. If you are just applying a NumPy reduction function this will achieve much better performance. The raw parameter is required and will show a FutureWarning if not passed. In the future raw will default to False.

New in version 0.23.0.

As noted; if you only need one single dimension, passing it raw is obviously more efficient. This is probably the answer to your question; Rolling.apply() was initially built to pass an ndarray only because this is the most efficient.

Collectives™ on Stack Overflow

why does pandas rolling use single dimension ndarray

Question is Why?

4 Answers 4

Roll Function

Demonstration

3 Comments

1 Comment

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Question is Why?

4 Answers 4

Roll Function

Demonstration

3 Comments

1 Comment

Comments

Comments

Linked

Related