Conversation
| Can you do a small example (even one that does not work currently) to showcase a bit how you are thinking to use such an object? Thanks!! |
| I added an example with some possible parameters that we could add to the encoder. I've been using this dataset as an example https://www.kaggle.com/datasets/mylesoneill/warcraft-avatar-history?select=wowah_data.csv I've already noticed a pretty big difference in performance between pandas and polars. |
| I think the code can already be reviewed for early comments. I've added tests and doctests, and so far the basic approach for sessionization is working. In the end I decided to avoid adding more features because those only make sense after aggregation, and that should be done by the user with the data ops. Something that needs to be decided is the resolution of the session duration: for now, it's in minutes, with 30 minutes being the default value. We might want to change that to seconds, or add a parameter so the user can decide the resolution by themselves. |
| I also just realized that the "by" column is not necessary: maybe the sessions are just sequence of operations executed by the user and we only care that there is a gap with a certain duration between actions to mark a new session. I'll update the code to reflect that. |
not sure I understand -- we need to know which column(s) identify users, to have one log per user, and then sessionize that: an action from user B does not prolong the session of user A |
I could have a dataset where I only have the timestamp, like a logfile that has only the events. Then, I may still want to group the events so that sessions are delimited by periods of activity. Like, the single user has been connected for this long, then they disconnected, then they connected again. If there is only a single user we care about, then there's no need to group by that. |
| sorry, I had misunderstood the "is not necessary" as "we can remove this parameter". yes I agree it can be optional :) |
| The example is still missing, but the rest of the code is ready for review. There are two points that still need to be discussed:
I also need to see if the thing works when it's put in a pipeline or if I'm missing something. |
I would say by default any duration passed as a number is usually in seconds, and we can also allow strings like "2s", "2m", "2h" |
skrub/_session_encoder.py Outdated
| user_id timestamp action timestamp_session_id | ||
| 0 alice 2024-01-01 10:00:00 login 0 | ||
| 1 alice 2024-01-01 10:05:00 view 0 | ||
| 2 alice 2024-01-01 11:00:00 logout 1 |
There was a problem hiding this comment.
maybe pick something else than 'logout' because in this example it looks like the logout was in reality probably the last event of the first session
| if self.by is not None | ||
| else [self.timestamp] | ||
| ) | ||
| X_sorted = sbd.sort(X, by=sort_by) |
There was a problem hiding this comment.
maybe we could keep only the columns we need before sorting, so that X_sorted will be smaller in case X is large
| X_with_session_id = sbd.drop_columns(X_with_session_id, to_drop) | ||
| | ||
| self.all_outputs_ = sbd.column_names(X_with_session_id) | ||
| return X_with_session_id |
There was a problem hiding this comment.
I don't see where we restore the original row order?
There was a problem hiding this comment.
we don't, I need to fix that
Very early draft of an encoder that adds session IDs and statistics to a given dataframe
This one isn't a SingleColumnTransformer because it needs two separate columns.
To decide:
...
Example:
Given this dataset (randomly generated)
We should be able to fit transform the SessionEncoder like so:
and get something like this