Calculating the mean for each time of day with a rolling window with pandas

Question

I have a pandas dataframe that has a datetime index and four columns Phase 1, Phase 2, Phase 3 and Sum. The data was preprocessed and has a row every 15 minutes and is a few months long. The data is very cyclic and almost repeats every day but changes slowly over time. The goal is to produce the mean of the value at a certain time of day over the last week (or other timeframe) for all days. (for a machine learning task)

I've managed to calculate the mean for each time of day using this code: (This produces a 1-day long dataframe)

df.groupby(df.index.hour * 60 + df.index.minute).mean()

 Phase 1 Phase 2 Phase 3 Sum Time 0 10.105782 10.235237 9.990037 30.331055 15 10.106374 10.116440 9.991424 30.214238 30 10.106517 10.086310 10.003420 30.196246 45 10.128441 10.249100 10.032895 30.410436 ... 1410 10.112582 10.643766 9.971592 30.727939 1425 10.102739 10.372299 9.969986 30.445025

This mean of all days together isn't very good though since the data changes gradually. It would be better if I could calculate this type of mean, but only include data from the last week for each day.

What I've tried so far is this:

df .groupby(df.index.hour * 60 + df.index.minute) .rolling("7D", closed="left") .mean()

It produces the correct data, but the date information is missing (it needs to be preserved for future calculations) and the rows are in the wrong order.

 Phase 1 Phase 2 Phase 3 Sum Time 0 NaN NaN NaN NaN 0 10.064458 10.051470 10.177814 30.293742 0 10.043804 9.983143 10.062019 30.088965 0 10.020861 9.917236 10.000181 29.938278 ... 0 10.224965 10.507418 10.030670 30.763053 0 10.155706 10.396408 9.919538 30.471651 0 10.149112 10.352153 9.894257 30.395522 0 10.144540 10.349998 9.902504 30.397042 15 NaN NaN NaN NaN 15 10.061673 9.967295 10.143008 30.171976 15 10.059581 10.158814 10.051835 30.270230 15 9.995112 10.024808 9.999054 30.018974 ...

Also there's the issue of NaNs appearing when the first day is not fully present. Do incomplete days need to be removed first or can they be incorporated into the mean?

I've also tried this:

df .groupby([ pd.Grouper(freq="1D"), df.index.hour * 60 + df.index.minute ]) .rolling("7D", closed="left") .mean()

But it produces a dataframe conisting only of NaNs so something must be going very wrong.

The result is supposed to look something like this:

 Phase 1 Phase 2 Phase 3 Sum Time 2021-02-13 00:00:00 11.882597 12.779326 12.458625 37.120549 2021-02-13 00:15:00 11.866148 12.871785 12.509614 37.247547 2021-02-13 00:30:00 11.713676 12.730861 12.525868 36.970405 2021-02-13 00:45:00 11.742079 12.697406 12.592411 37.031897 2021-02-13 01:00:00 11.765234 12.848741 12.622687 37.236662 ... 2021-05-01 10:30:00 11.842673 12.190760 12.572203 36.605636 2021-05-01 10:45:00 11.837964 12.118095 12.611271 36.567331 2021-05-01 11:00:00 11.827275 12.220564 12.588131 36.635970

In this example, the second row contains the average values of 2021-02-13 00:15:00, 2021-02-12 00:15:00, ..., 2021-02-07 00:15:00. I'm not new to programming, but relatively new to python and pandas so any help and hints are very much appreciated.

perl · Accepted Answer · 2021-05-01 16:34:02Z

You can pre-filter the dataset to only include 13 days preceding the dt date, then groupby time, taking 7 days rolling with min_periods=7, take mean and dropna to remove dates that have accumulated values for fewer than 7 of the previous days:

# generate sample dataset ix = pd.date_range('2021-01-01', '2021-05-01', freq='15min') df = pd.DataFrame({ 'Phase1': np.random.uniform(0, 1, len(ix)), 'Phase2': np.random.uniform(0, 1, len(ix)), 'Phase3': np.random.uniform(0, 1, len(ix)), }, index=ix) df['Sum'] = df.sum(1) # set max date dt = pd.to_datetime('2021-02-14') # filter out values in [dt - 13 days, dt) z = df.loc[(df.index >= dt - pd.DateOffset(days=13)) & (df.index < dt)] # calculate 7-day rolling average for the same time of the day # for 7 days preceding `dt` (z .groupby(z.index.time) .rolling('7d', min_periods=7) .mean() .dropna() .droplevel(0) .sort_index())

Output:

 Phase1 Phase2 Phase3 Sum 2021-02-07 00:00:00 0.479466 0.731746 0.503017 1.714229 2021-02-07 00:15:00 0.443550 0.423135 0.543204 1.409889 2021-02-07 00:30:00 0.465272 0.626117 0.454462 1.545851 2021-02-07 00:45:00 0.528733 0.433475 0.386822 1.349029 2021-02-07 01:00:00 0.425309 0.360065 0.488509 1.273884 ... ... ... ... ... 2021-02-13 22:45:00 0.519717 0.490549 0.524330 1.534596 2021-02-13 23:00:00 0.367935 0.460093 0.373338 1.201366 2021-02-13 23:15:00 0.597424 0.438130 0.478259 1.513813 2021-02-13 23:30:00 0.675142 0.443580 0.330791 1.449514 2021-02-13 23:45:00 0.474604 0.355723 0.596467 1.426794

So how can I use this to calculate this for every day? Maybe my description was unclear but the mean of the last week should be calculated for all days, not just the last.
@douira Yes, sorry, then I misunderstood your question, let me update...
it's my fault the introduction was written a little weirdly, I think it's clearer now. Let me know if it's still confusing.
@douira OK, please see the updated version with the 7-day rolling average for the same time of the day for the last 7 days
it works well! I'm just using the last part of your code, starting with groupby since I want to process all days. Thanks for the help.

Collectives™ on Stack Overflow

Calculating the mean for each time of day with a rolling window with pandas

1 Answer 1

5 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Related