I am trying to use a pandas.DataFrame.rolling.apply() rolling function on multiple columns. Python version is 3.7, pandas is 1.0.2.
import pandas as pd #function to calculate def masscenter(x): print(x); # for debug purposes return 0; #simple DF creation routine df = pd.DataFrame( [['02:59:47.000282', 87.60, 739], ['03:00:01.042391', 87.51, 10], ['03:00:01.630182', 87.51, 10], ['03:00:01.635150', 88.00, 792], ['03:00:01.914104', 88.00, 10]], columns=['stamp', 'price','nQty']) df['stamp'] = pd.to_datetime(df2['stamp'], format='%H:%M:%S.%f') df.set_index('stamp', inplace=True, drop=True) 'stamp' is monotonic and unique, 'price' is double and contains no NaNs, 'nQty' is integer and also contains no NaNs.
So, I need to calculate rolling 'center of mass', i.e. sum(price*nQty)/sum(nQty).
What I tried so far:
df.apply(masscenter, axis = 1) masscenter is be called 5 times with a single row and the output will be like
price 87.6 nQty 739.0 Name: 1900-01-01 02:59:47.000282, dtype: float64 It is desired input to a masscenter, because I can easily access price and nQty using x[0], x[1]. However, I stuck with rolling.apply() Reading the docs DataFrame.rolling() and rolling.apply() I supposed that using 'axis' in rolling() and 'raw' in apply one achieves similiar behaviour. A naive approach
rol = df.rolling(window=2) rol.apply(masscenter) prints row by row (increasing number of rows up to window size)
stamp 1900-01-01 02:59:47.000282 87.60 1900-01-01 03:00:01.042391 87.51 dtype: float64 then
stamp 1900-01-01 02:59:47.000282 739.0 1900-01-01 03:00:01.042391 10.0 dtype: float64 So, columns is passed to masscenter separately (expected).
Sadly, in the docs there is barely any info about 'axis'. However the next variant was, obviously
rol = df.rolling(window=2, axis = 1) rol.apply(masscenter) Never calls masscenter and raises ValueError in rol.apply(..)
> Length of passed values is 1, index implies 5 I admit that I'm not sure about 'axis' parameter and how it works due to lack of documentation. It is the first part of the question: What is going on here? How to use 'axis' properly? What it is designed for?
Of course, there were answers previously, namely:
How-to-apply-a-function-to-two-columns-of-pandas-dataframe
It works for the whole DataFrame, not Rolling.
How-to-invoke-pandas-rolling-apply-with-parameters-from-multiple-column
The answer suggests to write my own roll function, but the culprit for me is the same as asked in comments: what if one needs to use offset window size (e.g. '1T') for non-uniform timestamps?
I don't like the idea to reinvent the wheel from scratch. Also I'd like to use pandas for everything to prevent inconsistency between sets obtained from pandas and 'self-made roll'. There is another answer to that question, suggessting to populate dataframe separately and calculate whatever I need, but it will not work: the size of stored data will be enormous. The same idea presented here:
Apply-rolling-function-on-pandas-dataframe-with-multiple-arguments
Another Q & A posted here
Pandas-using-rolling-on-multiple-columns
It is good and the closest to my problem, but again, there is no possibility to use offset window sizes (window = '1T').
Some of the answers were asked before pandas 1.0 came out, and given that docs could be much better, I hope it is possible to roll over multiple columns simultaneously now.
The second part of the question is: Is there any possibility to roll over multiple columns simultaneously using pandas 1.0.x with offset window size?
columns=['stamp', 'price','nQty']?