I have a dataset of web browsing histories for users visiting a particular website over a period of time (say the last 90 days). Each user has a unique ID and several records showing when he/she visited a particular page on the website.
It looks like follow:
UserID,Timestamp,Path U_1,2017-01-24 12:05:43,/sport/rugby/article_title U_1,2017-01-24 12:06:56,/sport/football/article_title U_1,2017-01-24 15:26:12,/finance/local/article_title ...... I do not have access to the content of the articles, I just know the path to the article.
My goal is to build a classifier to predict if a user will take an action or not. So I need to extract features from each user data.
Suppose that I have ground-truth information associated with each user, indicating when a user did the action.
My first guess is to aggregate all the records of each user and extract frequency features (hashing TF) from each level of the path.
So for demonstration, a particular user might visit the /sport category 5 times (first level category), and the /sport/football category 3 times (second level category), and the /sport/rugby 2 times (second level category).
So for each user I will have a feature vector representing the frequency of the first level categories, and another one for the second level categories and so on.
I can now train a classifier for each feature vector and do a late fusion of the results, or I can concatenate (early fusion) the different features and train a single classifier.
I can also extract the terms from the article titles and build a TFIDF feature.
What I am trying now is to extract the features from the N days proceeding taking the actions for the positive samples, and randomly select N consequent days from negative users.
What are the possible other features that I can extract, and is there any better ML techniques to use in order to model and learn the user web browsing behaviour?