Adding the SessionEncoder by rcap107 · Pull Request #1930 · skrub-data/skrub

rcap107 · 2026-02-24T16:48:25Z

Very early draft of an encoder that adds session IDs and statistics to a given dataframe

This one isn't a SingleColumnTransformer because it needs two separate columns.

To decide:

What's the granularity we want to go for in the session duration? minutes? seconds? let the user choose?
What statistics should be added?
...

Example:
Given this dataset (randomly generated)

timestamp user_id value 0 2024-01-01 00:00:00 101 -1.072157 1 2024-01-01 00:02:00 101 -2.281613 2 2024-01-01 00:04:00 102 0.462635 3 2024-01-01 00:06:00 101 1.929639 4 2024-01-01 00:08:00 101 1.285300 5 2024-01-11 00:00:00 101 1.279716 6 2024-01-11 00:02:00 101 0.958734 7 2024-01-11 00:04:00 102 0.041920 8 2024-01-11 00:06:00 102 0.383161 9 2024-01-11 00:08:00 101 -0.021126 10 2024-01-21 00:00:00 101 -1.605621 11 2024-01-21 00:02:00 102 -0.076007 12 2024-01-21 00:04:00 101 1.609458 13 2024-01-21 00:06:00 102 -0.137637 14 2024-01-21 00:08:00 101 -0.317564

We should be able to fit transform the SessionEncoder like so:

se = SessionEncoder(add_duration=True, add_sessions_per_user=True, add_session_time=True) se.fit_transform(df)

and get something like this

timestamp user_id value session_id session_duration \ 0 2024-01-01 00:00:00 101 -1.072157 0 4 1 2024-01-01 00:02:00 101 -2.281613 0 4 2 2024-01-01 00:06:00 101 1.929639 0 4 3 2024-01-01 00:08:00 101 1.285300 0 4 4 2024-01-11 00:00:00 101 1.279716 1 3 5 2024-01-11 00:02:00 101 0.958734 1 3 6 2024-01-11 00:08:00 101 -0.021126 1 3 7 2024-01-21 00:00:00 101 -1.605621 2 3 8 2024-01-21 00:04:00 101 1.609458 2 3 9 2024-01-21 00:08:00 101 -0.317564 2 3 10 2024-01-01 00:04:00 102 0.462635 3 1 11 2024-01-11 00:04:00 102 0.041920 4 2 12 2024-01-11 00:06:00 102 0.383161 4 2 13 2024-01-21 00:02:00 102 -0.076007 5 2 14 2024-01-21 00:06:00 102 -0.137637 5 2 sessions_per_user total_session_time 0 10 12000000 1 10 12000000 2 10 12000000 3 10 12000000 4 10 12000000 5 10 12000000 6 10 12000000 7 10 12000000 8 10 12000000 9 10 12000000 10 5 6000000 11 5 6000000 12 5 6000000 13 5 6000000 14 5 6000000

GaelVaroquaux · 2026-02-24T17:10:21Z

Can you do a small example (even one that does not work currently) to showcase a bit how you are thinking to use such an object? Thanks!!

rcap107 · 2026-02-25T09:06:22Z

I added an example with some possible parameters that we could add to the encoder.

I've been using this dataset as an example https://www.kaggle.com/datasets/mylesoneill/warcraft-avatar-history?select=wowah_data.csv

I've already noticed a pretty big difference in performance between pandas and polars.

rcap107 · 2026-02-25T14:53:09Z

I think the code can already be reviewed for early comments. I've added tests and doctests, and so far the basic approach for sessionization is working.

In the end I decided to avoid adding more features because those only make sense after aggregation, and that should be done by the user with the data ops.

Something that needs to be decided is the resolution of the session duration: for now, it's in minutes, with 30 minutes being the default value. We might want to change that to seconds, or add a parameter so the user can decide the resolution by themselves.

rcap107 · 2026-02-25T16:36:08Z

I also just realized that the "by" column is not necessary: maybe the sessions are just sequence of operations executed by the user and we only care that there is a gap with a certain duration between actions to mark a new session.

I'll update the code to reflect that.

jeromedockes · 2026-02-25T16:49:59Z

I also just realized that the "by" column is not necessary:

not sure I understand -- we need to know which column(s) identify users, to have one log per user, and then sessionize that: an action from user B does not prolong the session of user A

rcap107 · 2026-02-25T16:57:17Z

I also just realized that the "by" column is not necessary:

not sure I understand -- we need to know which column(s) identify users, to have one log per user, and then sessionize that: an action from user B does not prolong the session of user A

I could have a dataset where I only have the timestamp, like a logfile that has only the events. Then, I may still want to group the events so that sessions are delimited by periods of activity. Like, the single user has been connected for this long, then they disconnected, then they connected again. If there is only a single user we care about, then there's no need to group by that.

jeromedockes · 2026-02-25T16:59:47Z

sorry, I had misunderstood the "is not necessary" as "we can remove this parameter". yes I agree it can be optional :)

rcap107 · 2026-02-26T11:00:08Z

The example is still missing, but the rest of the code is ready for review.

There are two points that still need to be discussed:

How do we define the resolution of the session gap? for the moment it's in minutes. Should I add a parameter to let the user decide the granularity?
What should I use as the name for the session id column? At the moment I just have "session_id", but it should probably be something like "TIMESERIES_NAME_session_id" instead

I also need to see if the thing works when it's put in a pipeline or if I'm missing something.

jeromedockes · 2026-02-26T15:52:27Z

How do we define the resolution of the session gap? for the moment it's in minutes. Should I add a parameter to let the user decide the granularity?

I would say by default any duration passed as a number is usually in seconds, and we can also allow strings like "2s", "2m", "2h"

skrub/_session_encoder.py

jeromedockes · 2026-02-26T16:06:02Z

skrub/_session_encoder.py

+ user_id timestamp action timestamp_session_id
+ 0 alice 2024-01-01 10:00:00 login 0
+ 1 alice 2024-01-01 10:05:00 view 0
+ 2 alice 2024-01-01 11:00:00 logout 1


maybe pick something else than 'logout' because in this example it looks like the logout was in reality probably the last event of the first session

skrub/_session_encoder.py

jeromedockes · 2026-02-26T16:09:17Z

skrub/_session_encoder.py

+ if self.by is not None
+ else [self.timestamp]
+ )
+ X_sorted = sbd.sort(X, by=sort_by)


maybe we could keep only the columns we need before sorting, so that X_sorted will be smaller in case X is large

Yes that's a good idea

skrub/_session_encoder.py

jeromedockes · 2026-02-26T16:20:02Z

skrub/_session_encoder.py

+ X_with_session_id = sbd.drop_columns(X_with_session_id, to_drop)
+
+ self.all_outputs_ = sbd.column_names(X_with_session_id)
+ return X_with_session_id


I don't see where we restore the original row order?

we don't, I need to fix that

Adding the SessionEncoder

f7fdcd7

rcap107 added 7 commits February 25, 2026 11:51

more work

65be83a

adding tests

a6caeb7

changelog

cc14c55

adding drop cols, various improvements

dba476f

adding a test

d46090d

simplifying tests and code

28b958a

docstrings

ac7389d

rcap107 added 5 commits February 26, 2026 11:11

adding support for multiple by columns

8a61ea0

fixing optional by

a70006f

ddocs

429f155

fixing a compatibility problem

23c4111

changelog

acb091f

rcap107 marked this pull request as ready for review February 26, 2026 10:56

rcap107 added 5 commits February 26, 2026 13:35

improving tests

d69fb9e

renaming session id column

e0199f5

fixing some broken tests

603a57b

doctest

9d2c577

testing error dispatch

8d869b2

jeromedockes reviewed Feb 26, 2026

View reviewed changes

addressing some of the coments

01fed5f

rcap107 added 2 commits February 27, 2026 11:50

addressing more comments

746109b

fixing test

71302b1

jeromedockes added this to Labs Feb 27, 2026

jeromedockes moved this to Done in Labs Feb 27, 2026

jeromedockes moved this from Done to In progress in Labs Feb 27, 2026

Merge remote-tracking branch 'upstream/HEAD' into feat-session-encoder

ea99353

rcap107 linked an issue Mar 12, 2026 that may be closed by this pull request

FEAT - Adding a transformer to sessionize a table #750

Open

rcap107 added 2 commits March 27, 2026 10:53

Merge remote-tracking branch 'upstream/HEAD' into feat-session-encoder

bca2db2

changelo

868d529

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding the SessionEncoder#1930

Adding the SessionEncoder#1930
rcap107 wants to merge 24 commits intoskrub-data:mainfrom
rcap107:feat-session-encoder

rcap107 commented Feb 24, 2026 •

edited

Loading

GaelVaroquaux commented Feb 24, 2026 via email

rcap107 commented Feb 25, 2026

rcap107 commented Feb 25, 2026

rcap107 commented Feb 25, 2026

jeromedockes commented Feb 25, 2026

rcap107 commented Feb 25, 2026

jeromedockes commented Feb 25, 2026

rcap107 commented Feb 26, 2026

jeromedockes commented Feb 26, 2026

Uh oh!

Uh oh!

jeromedockes Feb 26, 2026

Uh oh!

jeromedockes Feb 26, 2026

rcap107 Feb 26, 2026

Uh oh!

jeromedockes Feb 26, 2026

rcap107 Feb 26, 2026

Labels

3 participants

Conversation

rcap107 commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GaelVaroquaux commented Feb 24, 2026 via email

rcap107 commented Feb 25, 2026

rcap107 commented Feb 25, 2026

rcap107 commented Feb 25, 2026

jeromedockes commented Feb 25, 2026

rcap107 commented Feb 25, 2026

jeromedockes commented Feb 25, 2026

rcap107 commented Feb 26, 2026

jeromedockes commented Feb 26, 2026

Uh oh!

Uh oh!

jeromedockes Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

jeromedockes Feb 26, 2026

Choose a reason for hiding this comment

rcap107 Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

jeromedockes Feb 26, 2026

Choose a reason for hiding this comment

rcap107 Feb 26, 2026

Choose a reason for hiding this comment

Labels

3 participants

rcap107 commented Feb 24, 2026 •

edited

Loading