Should I add string feature columns?

Question

If my dataframe looks like this:

user item property_1 property_2 property_3 rating u1 i1 90.2 0 NaN 0 u1 i2 80.2 1 0.90 1 u1 i3 70.2 1 NaN 1 u2 i2 80.2 1 0.90 0 u2 i4 80.4 0 0.10 1 u3 i1 90.2 0 NaN 1 u3 i4 80.4 0 0.10 1 u3 i5 93.9 1 0.33 0 u3 i6 90.9 0 0.55 0 u4 i1 90.2 0 NaN 0 u4 i6 90.9 0 0.55 1 u4 i7 50.2 1 NaN 1

And I want to predict what rating would a user give to an item using these properties, what method should I apply? Something that would look at the user-item pairs.

Because I used XGBoost for classification, with property_1, property_2, property_3 as features, I obtained good results, but my model doesn't know that more users rated the same item, does it? That the users and items appear multiple times, even if I have no duplicates. For example, second row and fourth row have the same properties, but different ratings, because the users are different:

user item property_1 property_2 property_3 rating u1 i2 80.2 1 0.90 1 u2 i2 80.2 1 0.90 0

I already have a collaborative filtering in a separate model that works well, but it doesn't look at the properties of the item, which is something that I want to use. And if I add item as a feature column I get the error:

ValueError: DataFrame.dtypes for data must be int, float, bool or categorical. When categorical type is supplied, DMatrix parameter `enable_categorical` must be set to `True`.item

Peter · Accepted Answer · 2021-11-29 15:36:08Z

Regarding feature encoding

As the error says, strings are not accepted. You need to transform item in a way that can be digested by xgboost (essentially some numerical representation).

There seems to be a method to transform to categorical when generating the DMatrix (see the docs, never tried it).

However, my first idea would be to "one hot" encode item using sklearn.preprocessing.OneHotEncoder. Another method would be to use pandas.get_dummies.

Update: Regarding the general model setup

Your model is not well described in the question. My understanding is that your model looks like:

$$rating (user,item,...),$$

so you aim at predicting the rating for a given user and item etc.

Each user and each item can be thought of as having an own "identity" (happy, grumpy person; quality product, cheap product etc.). This is called a "fixed effect" in econometrics. The canonical approach is to add one dummy (one-hot) for each user, item. In linear models, this introduces a "level shift" (additional intercept term). So a "cheap" product would get a lower rating on average. Or a grumpy person would generally give a lower score (lower intercept) for a given product compared to other people.

In tree based models the effect is less clear. However, provided that there is quite a large literature using "fixed effects", I suppose this is a good starting point. You need to distinguish the sources of ratings as good as possible and individual product/user aspects are important and can easily be represented by a "dummy".

One open question is how to deal with "out of sample" users. I might think that there are no out of sample items, but I guess that there are out of sample users. If true, you would need to "approximate" the user's identity by socio-economic variables (age, gender, education, preferences, etc.).

Thank you so much! I will look into it and solve the problem, I already accepted the answer. I just have one more question, if you could help me. Do you think my approach is correct, i.e. making "items" also a feature? (Should "users" also be considered features?) Will this basically answer the question "How would this user rate that item?". This is like my biggest concern right now and I would be very grateful if you could clear this up for me haha. — futuredataengineer
– futuredataengineer, Commented Nov 29, 2021 at 14:31
thank you, I used "Pandas get dummies" for users and item and ended up with 4000-something feature columns (also very sparse), for which my system is too slow. I have no other data about the users, other than their id's, and ofc how bad or good they rate items generally. Regardless of my slow system performance, is this how it should theoretically work? Sorry for the trivial questions, I am very new to this :( — futuredataengineer
– futuredataengineer, Commented Nov 29, 2021 at 18:02

Stack Exchange Network

Should I add string feature columns?

1 Answer 1

Hot Network Questions

Should I add string feature columns?

1 Answer 1

Related

Hot Network Questions