Questions tagged [bag-of-words]
A way of representing language data that consists of the constituent words w/ their individual frequencies. Ie, grammar & order, etc, are dropped to simplify the data.
38 questions
0 votes
0 answers
75 views
Conditional independence assumption for Naive Bayes with Multinomial distribution
I was going through Naive Bayes Classifier (from Cornell Machine Learning course (link here) and I found quite confusing the use of the Naive Bayes classifier for bag-of-words with the Multinomial ...
2 votes
2 answers
133 views
End-Tokens are Required to make Ngram Models Proper
The standard bigram model, (for example defined here) defines a probability distribution over a corpus $V$ based on the following principles: The marginal probability of a word $w$ is defined as its ...
0 votes
1 answer
58 views
Continuous Bag of Words derivation
The continuous bag of words model has the following log probability for observing a sequence of words: $$\log P(\textbf{w})=\sum_{c=1}^{C}\log{P(w_c|w_{c-m},...w_{c-1}, w_{c+1},...,w_{c+m}})$$ I don't ...
1 vote
0 answers
55 views
What is a word embedding approach that would work for these pre-labeled documents?
My Situation: I should start off with my end goal: I want to get a distance metric between each document and all of the other documents To get there, I first need to encode these topic labels so that ...
0 votes
0 answers
92 views
Continuous Bag of Words NY Time Corpus
I am working to implement the continuous bag of words approach on the New York Times corpus dataset. However, I am getting word embeddings that do not seem very useful based on a few examples of ...
0 votes
0 answers
81 views
why is using a small vocabulary for topic modelling bad?
i am trying to classify texts into topics. for example, let's say one of the topics is cooperation. so in the vocab param of the sklearn api. so some of the prevalent words (or "tokens" are ...
0 votes
0 answers
111 views
Fast feature selection on a huge dataset in R on a term document matrix
I have a 500K rows x 10K features dataset. It consists into : a term document matrix with words + bigram and TF-IDF weighting 6 one hot encoded multi-labels It is much more features that I want to ...
1 vote
1 answer
263 views
BOW features classifying better than complex models like BERT
I am doing a document classification task and I find that using simple BOW features with a random forest provide better results than using complex models like BERT or ELECTRA even after doing some ...
0 votes
1 answer
66 views
Transforming topics into text data
I was reading some articles on topic classification, in which some algorithm uses snippets of text as input and tries to classify them in topics, and I thought of implementing this technique in my ...
2 votes
3 answers
497 views
How can "word hashing" cause a collision in DSSM?
They say in their paper, that "word hashing" can cause a collision. But I don't understand, how. For example, if word good is tranformed to ...
1 vote
0 answers
230 views
Language Identification Better Results with Unigrams
I have a school project which consists of identifying each language of a tweet from a dataset of tweets. The dataset contains tweets in Spanish, Portuguese, English, Basque, Galician and Catalan. The ...
1 vote
2 answers
565 views
Why CBOW model is called "continuous"?
The question is pretty clear from the Title itself, why the Continuous Bag of Words (CBOW) model is called continuous. I also don't know what exactly "distributed representation" of word vectors ...
2 votes
1 answer
1k views
could someone please give a concrete example to illustrate the Dirichlet distribution prior for bag-of-words?
I am aware of the notion of the Dirichlet distribution, a multivariate generalization of the beta distribution. To get parameters of the Dirichlet distribution prior for bag-of-words, this CMU ...
3 votes
2 answers
300 views
could someone please give an concrete example to illustrate what does Multiplicity mean in the context of Bag-of-words model?
This CMU Machine Learning Course is using the Bag-of-words model without too much explanation. wiki uses the term multiplicity to explain that model. The bag-of-words model is a simplifying ...
1 vote
2 answers
778 views
Text classification with small dataset for a specialized domain
I have a multiclass text classification problem where I have very few documents for each class. The classes are imbalanced but I want to be able to predict the class when I have at least 200 - 300 ...