1
$\begingroup$

I would like to build a recommendation system:

  • no ratings are available at the time of recommendation, therefore only a purely context-based recommendation system is needed

  • as input features answers of a questionnaire are available (all categorical)

My idea is the following:

  • Find the most similar users based on the answers from the questionnaire with a suitable distance measure.

  • the past recommendations of these users are relevant and meaningful for the new user in the system


When choosing the encoding and distance measure, I have the problem that there are only categorical variables with values from binary to questions with 20 unique values. One-hot encoding has its drawbacks with multicollinearity and I'm not sure since variables with 20 unique possibilties get such a strong emphasis.

Does anyone have a recommendation for a possible approach? Thanks a lot!

$\endgroup$
1
  • $\begingroup$ Clustering Categorical Data using Gower distance (in Python): link $\endgroup$ Commented Oct 5, 2022 at 5:32

1 Answer 1

1
$\begingroup$

In r there is a package called dprep and it holds a magical method call knngow(). This is a KNN algorithm which uses the gower distance (not a physical distance like Euclidean or Manhattan).

It is specifically useful for working with nominal and ordinal variables that translate into binary or leveled factors because it is able to manage & differentiate between the regular interval between levels in a variable without being biased by ranks.

There is a dearth of good tutorials or information on it, but it is a solid step in the right direction for you because it solves the distance dilemma under the hood.

$\endgroup$
1

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.