Feature extraction from web browsing history of one website

Question

I have a dataset of web browsing histories for users visiting a particular website over a period of time (say the last 90 days). Each user has a unique ID and several records showing when he/she visited a particular page on the website.

It looks like follow:

UserID,Timestamp,Path U_1,2017-01-24 12:05:43,/sport/rugby/article_title U_1,2017-01-24 12:06:56,/sport/football/article_title U_1,2017-01-24 15:26:12,/finance/local/article_title ......

I do not have access to the content of the articles, I just know the path to the article.

My goal is to build a classifier to predict if a user will take an action or not. So I need to extract features from each user data.

Suppose that I have ground-truth information associated with each user, indicating when a user did the action.

My first guess is to aggregate all the records of each user and extract frequency features (hashing TF) from each level of the path.

So for demonstration, a particular user might visit the /sport category 5 times (first level category), and the /sport/football category 3 times (second level category), and the /sport/rugby 2 times (second level category).

So for each user I will have a feature vector representing the frequency of the first level categories, and another one for the second level categories and so on.

I can now train a classifier for each feature vector and do a late fusion of the results, or I can concatenate (early fusion) the different features and train a single classifier.

I can also extract the terms from the article titles and build a TFIDF feature.

What I am trying now is to extract the features from the N days proceeding taking the actions for the positive samples, and randomly select N consequent days from negative users.

What are the possible other features that I can extract, and is there any better ML techniques to use in order to model and learn the user web browsing behaviour?

geompalik · Accepted Answer · 2017-01-24 18:32:38Z

What you describe uses the data that the paths can offer. You can easily generate features from the data and time. For instance, given the date, you can generate a categorical variable denoting the weekday (Monday, Tuesday, etc..). Given the timestamp, you can generate binary variables to partition the day in four or more partitions: is_morning, is_afternoon etc.. Somebody may only read in the morning or at night, and the aim of these features is to capture this.

Further, you can get the interactions between weekdays and day partitions. Such features may help to distinguish users that in the Sunday mornings read about sports while Monday mornings they are at work and read financial news. Be careful of the overfitting though. Note that trees have been shown to capture such complex interactions; given them explicitly is beneficial though.

Thanks! I would like just to have your opinion on the training. Knowing that you are proposing trees for training. They are known to be expensive. Which leads to concatenate all features into one vector and apply the trees training? Or you still in favour of per feature classifier then late fusion if the scores? — Rami
– Rami, Commented Jan 24, 2017 at 19:12
No, I am in favor of one classifier over the complete set of the features. This allows the model to combine information, as discussed in the answer. — geompalik
– geompalik, Commented Jan 25, 2017 at 16:05

Keith · Accepted Answer · 2018-07-03 16:23:29Z

This seems similar to a problem I am working on here. You do not state explicitly what sort of action you are trying to predict but I think this is essentially the same problem. One thing I have found useful for feature preparation is to use several look-back-windows of different lengths. For your case the number of category visits in the last day, week and month. If you are looking for a change in behavior as a trigger this will help you pick that up. The length of windows you use depends on the specifics of the problem but you should have at least one window long enough to capture the state prior to the one which triggers.

I agree with the comments in the answer by geompalik. The features he describes will help with seasonality. You should use one classifier; boosted decision trees should be a good first try.

Stack Exchange Network

Feature extraction from web browsing history of one website

2 Answers 2

Linked

Hot Network Questions

Feature extraction from web browsing history of one website

2 Answers 2

Linked

Related

Hot Network Questions