3

The goal is to compute a rolling average over 5-minute windows using Polars, where:

  • the window ends at each timestamp t
  • the left edge of the window is not strict — if there are no values exactly at t - 5min, it should include an earlier point
  • input data contains columns: timestamp and value

In Pandas, I used to handle this kind of logic easily by overriding BaseIndexer along with some custom Numba-accelerated logic to compute rolling window boundaries manually. I'd love to have similar functionality in Polars, but I don't know how to achieve it yet.

Here is an example:

import polars as pl df = pl.DataFrame({ "timestamp": [ "2024-04-10 10:00:01", "2024-04-10 10:01:30", "2024-04-10 10:03:10", "2024-04-10 10:05:00", "2024-04-10 10:06:00", "2024-04-10 10:08:10", ], "value": [1, 2, 3, 4, 5, 6] }).with_columns( pl.col("timestamp").str.to_datetime() ) 
┌─────────────────────┬───────┐ │ timestamp ┆ value │ │ --- ┆ --- │ │ datetime[μs] ┆ i64 │ ╞═════════════════════╪═══════╡ │ 2024-04-10 10:00:01 ┆ 1 │ │ 2024-04-10 10:01:30 ┆ 2 │ │ 2024-04-10 10:03:10 ┆ 3 │ │ 2024-04-10 10:05:00 ┆ 4 │ │ 2024-04-10 10:06:00 ┆ 5 │ │ 2024-04-10 10:08:10 ┆ 6 │ └─────────────────────┴───────┘ 

After applying a rolling window, I want to get the following:

(df.rolling("timestamp", period="5m+????", closed="both") #?????? .agg( pl.col("value"), pl.mean("value").alias("rolling_value")) ) ) 
┌─────────────────────┬────────────────┬───────────────┐ │ timestamp ┆ value ┆ rolling_value │ │ --- ┆ --- ┆ --- │ │ datetime[μs] ┆ list[i64] ┆ f64 │ ╞═════════════════════╪════════════════╪═══════════════╡ │ 2024-04-10 10:00:01 ┆ [1] ┆ 1.0 │ │ 2024-04-10 10:01:30 ┆ [1, 2] ┆ 1.5 │ │ 2024-04-10 10:03:10 ┆ [1, 2, 3] ┆ 2.0 │ │ 2024-04-10 10:05:00 ┆ [1, 2, 3, 4] ┆ 2.5 │ │ 2024-04-10 10:06:00 ┆ [1, 2, 3, 4, 5]┆ 3.0 │. <- include the first value │ 2024-04-10 10:08:10 ┆ [3, 4, 5, 6] ┆ 4.5 │ └─────────────────────┴────────────────┴───────────────┘ 

2 Answers 2

3

Do a rolling with a left-exclusive window and then a backwards join_asof to add the extra value you want outside the window back in:

(df.rolling("timestamp", period="5m", closed="right") .agg("value") .join_asof(df, left_on="timestamp", right_on=pl.col.timestamp.dt.offset_by("5m")) .select( pl.col.timestamp, pl.when(pl.col.value_right.is_null()) .then(pl.col.value) .otherwise(pl.concat_list(pl.col.value_right, pl.col.value)) ) .with_columns(rolling_value=pl.col.value.list.mean()) ) 
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you! But I have 20 million rows of data, and it crashes. A regular .rolling(...).agg(pl.mean(...)) works in less than a second. I understand that solution has too much overhead because of the lists and the mean is not calculated incrementally. Anyway, it seems the only real solution is to have a way in Polars to define custom rolling windows maybe like this: .rolling("timestamp", period="5m+1e") or .rolling(window_size=pl.col("backward_offset"), ...)
@MikeChurch You can adopt the solution to calculate a sum + length and use that to compute the mean rather than use .list.mean().
0

You can use cross join

import polars as pl from datetime import timedelta df = pl.DataFrame({ "timestamp": [ "2024-04-10 10:00:01", "2024-04-10 10:01:30", "2024-04-10 10:03:10", "2024-04-10 10:05:00", "2024-04-10 10:06:00", "2024-04-10 10:08:10", ], "value": [1, 2, 3, 4, 5, 6] }).with_columns( pl.col("timestamp").str.strptime(pl.Datetime, "%Y-%m-%d %H:%M:%S") ).sort("timestamp") df_cross = df.join(df, how="cross") windowed = df_cross.filter( (pl.col("timestamp_right") <= pl.col("timestamp")) & (pl.col("timestamp_right") >= pl.col("timestamp") - timedelta(minutes=5)) ) result = windowed.group_by("timestamp").agg([ pl.col("value_right").alias("value_window"), pl.col("value_right").mean().alias("rolling_value") ]) print(result) 

Output:

shape: (6, 3) ┌─────────────────────┬──────────────┬───────────────┐ │ timestamp ┆ value_window ┆ rolling_value │ │ --- ┆ --- ┆ --- │ │ datetime[μs] ┆ list[i64] ┆ f64 │ ╞═════════════════════╪══════════════╪═══════════════╡ │ 2024-04-10 10:01:30 ┆ [1, 2] ┆ 1.5 │ │ 2024-04-10 10:03:10 ┆ [1, 2, 3] ┆ 2.0 │ │ 2024-04-10 10:08:10 ┆ [3, 4, … 6] ┆ 4.5 │ │ 2024-04-10 10:00:01 ┆ [1] ┆ 1.0 │ │ 2024-04-10 10:06:00 ┆ [2, 3, … 5] ┆ 3.5 │ │ 2024-04-10 10:05:00 ┆ [1, 2, … 4] ┆ 2.5 │ └─────────────────────┴──────────────┴───────────────┘ 

1 Comment

Unfortunately, such a solution has disastrous performance for large amounts of data.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.