I was motivated to use pandas rolling feature to perform a rolling multi-factor regression (This question is NOT about rolling multi-factor regression). I expected that I'd be able to use apply after a df.rolling(2) and take the resulting pd.DataFrame extract the ndarray with .values and perform the requisite matrix multiplication. It didn't work out that way.
Here is what I found:
import pandas as pd import numpy as np np.random.seed([3,1415]) df = pd.DataFrame(np.random.rand(5, 2).round(2), columns=['A', 'B']) X = np.random.rand(2, 1).round(2) What do objects look like:
print "\ndf = \n", df print "\nX = \n", X print "\ndf.shape =", df.shape, ", X.shape =", X.shape df = A B 0 0.44 0.41 1 0.46 0.47 2 0.46 0.02 3 0.85 0.82 4 0.78 0.76 X = [[ 0.93] [ 0.83]] df.shape = (5, 2) , X.shape = (2L, 1L) Matrix multiplication behaves normally:
df.values.dot(X) array([[ 0.7495], [ 0.8179], [ 0.4444], [ 1.4711], [ 1.3562]]) Using apply to perform row by row dot product behaves as expected:
df.apply(lambda x: x.values.dot(X)[0], axis=1) 0 0.7495 1 0.8179 2 0.4444 3 1.4711 4 1.3562 dtype: float64 Groupby -> Apply behaves as I'd expect:
df.groupby(level=0).apply(lambda x: x.values.dot(X)[0, 0]) 0 0.7495 1 0.8179 2 0.4444 3 1.4711 4 1.3562 dtype: float64 But when I run:
df.rolling(1).apply(lambda x: x.values.dot(X)) I get:
AttributeError: 'numpy.ndarray' object has no attribute 'values'
Ok, so pandas is using straight ndarray within its rolling implementation. I can handle that. Instead of using .values to get the ndarray, let's try:
df.rolling(1).apply(lambda x: x.dot(X)) shapes (1,) and (2,1) not aligned: 1 (dim 0) != 2 (dim 0)
Wait! What?!
So I created a custom function to look at the what rolling is doing.
def print_type_sum(x): print type(x), x.shape return x.sum() Then ran:
print df.rolling(1).apply(print_type_sum) <type 'numpy.ndarray'> (1L,) <type 'numpy.ndarray'> (1L,) <type 'numpy.ndarray'> (1L,) <type 'numpy.ndarray'> (1L,) <type 'numpy.ndarray'> (1L,) <type 'numpy.ndarray'> (1L,) <type 'numpy.ndarray'> (1L,) <type 'numpy.ndarray'> (1L,) <type 'numpy.ndarray'> (1L,) <type 'numpy.ndarray'> (1L,) A B 0 0.44 0.41 1 0.46 0.47 2 0.46 0.02 3 0.85 0.82 4 0.78 0.76 My resulting pd.DataFrame is the same, that's good. But it printed out 10 single dimensional ndarray objects. What about rolling(2)
print df.rolling(2).apply(print_type_sum) <type 'numpy.ndarray'> (2L,) <type 'numpy.ndarray'> (2L,) <type 'numpy.ndarray'> (2L,) <type 'numpy.ndarray'> (2L,) <type 'numpy.ndarray'> (2L,) <type 'numpy.ndarray'> (2L,) <type 'numpy.ndarray'> (2L,) <type 'numpy.ndarray'> (2L,) A B 0 NaN NaN 1 0.90 0.88 2 0.92 0.49 3 1.31 0.84 4 1.63 1.58 Same thing, expect output but it printed 8 ndarray objects. rolling is producing a single dimensional ndarray of length window for each column as opposed to what I expected which was an ndarray of shape (window, len(df.columns)).
Question is Why?
I now don't have a way to easily run a rolling multi-factor regression.