0
$\begingroup$

I'm currently working on a project where I'm using an LSTM to learn and predict sequences of categorical data.

My dataset consists of variable-length sequences of items $s_i = [x_{i_0}, x_{i_1}, ..., x_{i_{-1}}]$. Each item $x$ belongs to one of 190,000 categories. Therefore, my initial approach has been to encode each $x$ with a one-hot vector representing its category.

However, using one-hot encoding makes the computational cost very high, and I want to find a more compact way of representing the categories before training the model.

In information theory, the base-2 logarithm of a number indicates how many bits are necessary to represent a number. In this case, log2(190000) =~ 18, which means that 18 bits are needed to represent each category.

So, my question is, what are good approaches to represent the categories using lower-dimensional vectors?

  • For example, I could use 18-dimensional binary vectors, where the index of each category is encoded in binary format. If I understand correctly, this approach would introduce a false additive relationship, but it might be worth experimenting with.

  • Alternatively, I am considering passing the one-hot vectors through a linear layer with 190k inputs and 18 outputs. The idea here is that the linear layer does not produce zero outputs but small floats. However, I don't know how well this approach can encode the categorical information.

Do these approaches make sense? Alternatively, what might be better ways to achieve what I'm looking for?

I appreciate any advice or thoughts on this approach. Thanks!

$\endgroup$

1 Answer 1

0
$\begingroup$

Use embeddings.

Your use case seems similar to language modelling, where text is represented as discrete tokens. While 190k categories are a lot, you can adjust the embedding dimensionality to fit your computational budget.

$\endgroup$
4
  • $\begingroup$ Thanks for your answer. As part of the project, I will be using Word2Vec embeddings, which I expect to produce the best results. However, I would like to investigate the difference in performance compared to a naive encoding like one-hot. The problem is that one-hot encoding with this dimensionality becomes too expensive with my data, even with very small batches. $\endgroup$ Commented Feb 28, 2023 at 19:19
  • $\begingroup$ I meant trainable embeddings, not pre-trained ones. Having a one-hot vector multiplied by a matrix is equivalent to having an embedding layer, but without the memory spent on the one-hot vectors. $\endgroup$ Commented Feb 28, 2023 at 19:33
  • $\begingroup$ Sorry, I don't think I can follow. I don't understand how your suggestion (having a one-hot vector multiplied by a matrix) differs from the linear layer approach in my question? Can you elaborate? $\endgroup$ Commented Feb 28, 2023 at 19:45
  • $\begingroup$ You save the memory of the one-hot vector with dimension 190000 x seq.length x batch size. $\endgroup$ Commented Feb 28, 2023 at 20:09

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.