How to efficiently reduce dimensions of one-hot encoded categorical values?

Question

I'm currently working on a project where I'm using an LSTM to learn and predict sequences of categorical data.

My dataset consists of variable-length sequences of items $s_i = [x_{i_0}, x_{i_1}, ..., x_{i_{-1}}]$. Each item $x$ belongs to one of 190,000 categories. Therefore, my initial approach has been to encode each $x$ with a one-hot vector representing its category.

However, using one-hot encoding makes the computational cost very high, and I want to find a more compact way of representing the categories before training the model.

In information theory, the base-2 logarithm of a number indicates how many bits are necessary to represent a number. In this case, log2(190000) =~ 18, which means that 18 bits are needed to represent each category.

So, my question is, what are good approaches to represent the categories using lower-dimensional vectors?

For example, I could use 18-dimensional binary vectors, where the index of each category is encoded in binary format. If I understand correctly, this approach would introduce a false additive relationship, but it might be worth experimenting with.
Alternatively, I am considering passing the one-hot vectors through a linear layer with 190k inputs and 18 outputs. The idea here is that the linear layer does not produce zero outputs but small floats. However, I don't know how well this approach can encode the categorical information.

Do these approaches make sense? Alternatively, what might be better ways to achieve what I'm looking for?

I appreciate any advice or thoughts on this approach. Thanks!

noe · Accepted Answer · 2023-02-28 18:56:38Z

0

Use embeddings.

Your use case seems similar to language modelling, where text is represented as discrete tokens. While 190k categories are a lot, you can adjust the embedding dimensionality to fit your computational budget.

answered Feb 28, 2023 at 18:56

noe

28.4k1 gold badge49 silver badges85 bronze badges

$\begingroup$ Thanks for your answer. As part of the project, I will be using Word2Vec embeddings, which I expect to produce the best results. However, I would like to investigate the difference in performance compared to a naive encoding like one-hot. The problem is that one-hot encoding with this dimensionality becomes too expensive with my data, even with very small batches. $\endgroup$

acxcv
– acxcv

2023-02-28 19:19:33 +00:00
Commented Feb 28, 2023 at 19:19
$\begingroup$ I meant trainable embeddings, not pre-trained ones. Having a one-hot vector multiplied by a matrix is equivalent to having an embedding layer, but without the memory spent on the one-hot vectors. $\endgroup$

noe
– noe

2023-02-28 19:33:00 +00:00
Commented Feb 28, 2023 at 19:33
$\begingroup$ Sorry, I don't think I can follow. I don't understand how your suggestion (having a one-hot vector multiplied by a matrix) differs from the linear layer approach in my question? Can you elaborate? $\endgroup$

acxcv
– acxcv

2023-02-28 19:45:53 +00:00
Commented Feb 28, 2023 at 19:45
$\begingroup$ You save the memory of the one-hot vector with dimension 190000 x seq.length x batch size. $\endgroup$

noe
– noe

2023-02-28 20:09:43 +00:00
Commented Feb 28, 2023 at 20:09

Add a comment |

Stack Exchange Network

How to efficiently reduce dimensions of one-hot encoded categorical values?

1 Answer 1

Hot Network Questions

How to efficiently reduce dimensions of one-hot encoded categorical values?

1 Answer 1

Related

Hot Network Questions