1
$\begingroup$

From a standpoint of interpretation, can I use NMF on one-hot encoded categorical data for dimension reduction? I have mixed data and was thinking about one-hot encoding the categorical features and min-max normalizing the numerical features.

I read that this approach is not good when using PCA (see Stack Overflow here), but NMF follows a different principle than PCA.

Is this a valid approach?

Thank you very much in advance!

Jimmy

$\endgroup$

2 Answers 2

1
$\begingroup$

There is in fact a probabilistic interpretation of NMF specifically when the objective function is the generalized KL divergence. The matrix entries are treated as the realizations of Poisson random variables, where the rate parameter is of each entry Xij is given by the dot product of the ith row of W and the jth column of H when decomposing X to WH. This takes advantage of the fact that the sum of two Poisson random variables is also Poisson, with a rate the sum of the two. This is what makes the model well suited for count data. The other objective function that is commonly used, the Frobenius norm, is generally very poorly suited to count data as in most applications, counts are very heteroskedastic. The Frobenius norm doesn't conform to intuition there as the contribution of a particular entry's error to the objective function is on the order of the error squared, whereas with the generalized KL divergence the contribution is roughly proportional to e*log(e) (e as the error). This penalizes reconstruction error in a way that takes into account the absolute size of the input as well.

But that also means that the contribution of small errors on small entries is comparatively much less than for large counts. It may turn out that NMF with the generalized KL divergence works just fine with a one hot encoding, but that isn't something you should count on. It's likely that you'll get a decomposition that way that's no better than if you were to decompose the numerical features only, and quite possibly worse. NMF isn't really a good way to take advantage of the structure of categorical features, and having that structure known can be very helpful for downstream tasks where another approach could use that. So I'd recommend that you try NMF with each objective function on the numerical features and use the concatenation of the learned components and the categorical features.

$\endgroup$
1
  • $\begingroup$ what do you mean by "use the concatenation of the learned components and the categorical features" - you mean 2 nmf ? $\endgroup$ Commented Mar 14, 2020 at 12:08
2
$\begingroup$

From a standpoint of interpretation, can I use NMF on one-hot encoded categorical data for dimension reduction?

Unfortunately NMF doesn't have a nice probabilistic interpretation as PCA. Because of this it's like asking if requirements are satisfied when there are no requirements.

We can tackle this problem from another angle though. I'd argue you should give it a try, since NMF is routinely used for similar problem: topic modeling. Your data will be binary after one-hot encoding so it is like count data (although a bit different).

NMF is good for topic modeling because when you factorize sparse matrices the factors also tend to be sparse. This may suggest that it could be useful in your case.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.