Skip to content

Adding the SessionEncoder#1930

Open
rcap107 wants to merge 24 commits intoskrub-data:mainfrom
rcap107:feat-session-encoder
Open

Adding the SessionEncoder#1930
rcap107 wants to merge 24 commits intoskrub-data:mainfrom
rcap107:feat-session-encoder

Conversation

@rcap107
Copy link
Copy Markdown
Member

@rcap107 rcap107 commented Feb 24, 2026

Very early draft of an encoder that adds session IDs and statistics to a given dataframe

This one isn't a SingleColumnTransformer because it needs two separate columns.

To decide:

  • What's the granularity we want to go for in the session duration? minutes? seconds? let the user choose?
  • What statistics should be added?
    ...

Example:
Given this dataset (randomly generated)

timestamp user_id value 0 2024-01-01 00:00:00 101 -1.072157 1 2024-01-01 00:02:00 101 -2.281613 2 2024-01-01 00:04:00 102 0.462635 3 2024-01-01 00:06:00 101 1.929639 4 2024-01-01 00:08:00 101 1.285300 5 2024-01-11 00:00:00 101 1.279716 6 2024-01-11 00:02:00 101 0.958734 7 2024-01-11 00:04:00 102 0.041920 8 2024-01-11 00:06:00 102 0.383161 9 2024-01-11 00:08:00 101 -0.021126 10 2024-01-21 00:00:00 101 -1.605621 11 2024-01-21 00:02:00 102 -0.076007 12 2024-01-21 00:04:00 101 1.609458 13 2024-01-21 00:06:00 102 -0.137637 14 2024-01-21 00:08:00 101 -0.317564 

We should be able to fit transform the SessionEncoder like so:

se = SessionEncoder(add_duration=True, add_sessions_per_user=True, add_session_time=True) se.fit_transform(df)

and get something like this

timestamp user_id value session_id session_duration \ 0 2024-01-01 00:00:00 101 -1.072157 0 4 1 2024-01-01 00:02:00 101 -2.281613 0 4 2 2024-01-01 00:06:00 101 1.929639 0 4 3 2024-01-01 00:08:00 101 1.285300 0 4 4 2024-01-11 00:00:00 101 1.279716 1 3 5 2024-01-11 00:02:00 101 0.958734 1 3 6 2024-01-11 00:08:00 101 -0.021126 1 3 7 2024-01-21 00:00:00 101 -1.605621 2 3 8 2024-01-21 00:04:00 101 1.609458 2 3 9 2024-01-21 00:08:00 101 -0.317564 2 3 10 2024-01-01 00:04:00 102 0.462635 3 1 11 2024-01-11 00:04:00 102 0.041920 4 2 12 2024-01-11 00:06:00 102 0.383161 4 2 13 2024-01-21 00:02:00 102 -0.076007 5 2 14 2024-01-21 00:06:00 102 -0.137637 5 2 sessions_per_user total_session_time 0 10 12000000 1 10 12000000 2 10 12000000 3 10 12000000 4 10 12000000 5 10 12000000 6 10 12000000 7 10 12000000 8 10 12000000 9 10 12000000 10 5 6000000 11 5 6000000 12 5 6000000 13 5 6000000 14 5 6000000 
@GaelVaroquaux
Copy link
Copy Markdown
Member

GaelVaroquaux commented Feb 24, 2026 via email

@rcap107
Copy link
Copy Markdown
Member Author

rcap107 commented Feb 25, 2026

I added an example with some possible parameters that we could add to the encoder.

I've been using this dataset as an example https://www.kaggle.com/datasets/mylesoneill/warcraft-avatar-history?select=wowah_data.csv

I've already noticed a pretty big difference in performance between pandas and polars.

@rcap107
Copy link
Copy Markdown
Member Author

rcap107 commented Feb 25, 2026

I think the code can already be reviewed for early comments. I've added tests and doctests, and so far the basic approach for sessionization is working.

In the end I decided to avoid adding more features because those only make sense after aggregation, and that should be done by the user with the data ops.

Something that needs to be decided is the resolution of the session duration: for now, it's in minutes, with 30 minutes being the default value. We might want to change that to seconds, or add a parameter so the user can decide the resolution by themselves.

@rcap107
Copy link
Copy Markdown
Member Author

rcap107 commented Feb 25, 2026

I also just realized that the "by" column is not necessary: maybe the sessions are just sequence of operations executed by the user and we only care that there is a gap with a certain duration between actions to mark a new session.

I'll update the code to reflect that.

@jeromedockes
Copy link
Copy Markdown
Member

I also just realized that the "by" column is not necessary:

not sure I understand -- we need to know which column(s) identify users, to have one log per user, and then sessionize that: an action from user B does not prolong the session of user A

@rcap107
Copy link
Copy Markdown
Member Author

rcap107 commented Feb 25, 2026

I also just realized that the "by" column is not necessary:

not sure I understand -- we need to know which column(s) identify users, to have one log per user, and then sessionize that: an action from user B does not prolong the session of user A

I could have a dataset where I only have the timestamp, like a logfile that has only the events. Then, I may still want to group the events so that sessions are delimited by periods of activity. Like, the single user has been connected for this long, then they disconnected, then they connected again. If there is only a single user we care about, then there's no need to group by that.

@jeromedockes
Copy link
Copy Markdown
Member

sorry, I had misunderstood the "is not necessary" as "we can remove this parameter". yes I agree it can be optional :)

@rcap107 rcap107 marked this pull request as ready for review February 26, 2026 10:56
@rcap107
Copy link
Copy Markdown
Member Author

rcap107 commented Feb 26, 2026

The example is still missing, but the rest of the code is ready for review.

There are two points that still need to be discussed:

  • How do we define the resolution of the session gap? for the moment it's in minutes. Should I add a parameter to let the user decide the granularity?
  • What should I use as the name for the session id column? At the moment I just have "session_id", but it should probably be something like "TIMESERIES_NAME_session_id" instead

I also need to see if the thing works when it's put in a pipeline or if I'm missing something.

@jeromedockes
Copy link
Copy Markdown
Member

How do we define the resolution of the session gap? for the moment it's in minutes. Should I add a parameter to let the user decide the granularity?

I would say by default any duration passed as a number is usually in seconds, and we can also allow strings like "2s", "2m", "2h"

user_id timestamp action timestamp_session_id
0 alice 2024-01-01 10:00:00 login 0
1 alice 2024-01-01 10:05:00 view 0
2 alice 2024-01-01 11:00:00 logout 1
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe pick something else than 'logout' because in this example it looks like the logout was in reality probably the last event of the first session

if self.by is not None
else [self.timestamp]
)
X_sorted = sbd.sort(X, by=sort_by)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we could keep only the columns we need before sorting, so that X_sorted will be smaller in case X is large

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's a good idea

X_with_session_id = sbd.drop_columns(X_with_session_id, to_drop)

self.all_outputs_ = sbd.column_names(X_with_session_id)
return X_with_session_id
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see where we restore the original row order?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't, I need to fix that

@jeromedockes jeromedockes moved this to Done in Labs Feb 27, 2026
@jeromedockes jeromedockes moved this from Done to In progress in Labs Feb 27, 2026
@rcap107 rcap107 linked an issue Mar 12, 2026 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants