Skip to main content

Questions tagged [bag-of-words]

A way of representing language data that consists of the constituent words w/ their individual frequencies. Ie, grammar & order, etc, are dropped to simplify the data.

0 votes
0 answers
75 views

I was going through Naive Bayes Classifier (from Cornell Machine Learning course (link here) and I found quite confusing the use of the Naive Bayes classifier for bag-of-words with the Multinomial ...
Reda A.'s user avatar
2 votes
2 answers
133 views

The standard bigram model, (for example defined here) defines a probability distribution over a corpus $V$ based on the following principles: The marginal probability of a word $w$ is defined as its ...
olives's user avatar
  • 93
0 votes
1 answer
58 views

The continuous bag of words model has the following log probability for observing a sequence of words: $$\log P(\textbf{w})=\sum_{c=1}^{C}\log{P(w_c|w_{c-m},...w_{c-1}, w_{c+1},...,w_{c+m}})$$ I don't ...
Victor M's user avatar
  • 339
1 vote
0 answers
55 views

My Situation: I should start off with my end goal: I want to get a distance metric between each document and all of the other documents To get there, I first need to encode these topic labels so that ...
Jacob Myer's user avatar
0 votes
0 answers
92 views

I am working to implement the continuous bag of words approach on the New York Times corpus dataset. However, I am getting word embeddings that do not seem very useful based on a few examples of ...
dzheng1887's user avatar
0 votes
0 answers
81 views

i am trying to classify texts into topics. for example, let's say one of the topics is cooperation. so in the vocab param of the sklearn api. so some of the prevalent words (or "tokens" are ...
yishairasowsky's user avatar
0 votes
0 answers
111 views

I have a 500K rows x 10K features dataset. It consists into : a term document matrix with words + bigram and TF-IDF weighting 6 one hot encoded multi-labels It is much more features that I want to ...
Xiiryo's user avatar
  • 111
1 vote
1 answer
263 views

I am doing a document classification task and I find that using simple BOW features with a random forest provide better results than using complex models like BERT or ELECTRA even after doing some ...
Atirag's user avatar
  • 159
0 votes
1 answer
66 views

I was reading some articles on topic classification, in which some algorithm uses snippets of text as input and tries to classify them in topics, and I thought of implementing this technique in my ...
enzo's user avatar
  • 113
2 votes
3 answers
497 views

They say in their paper, that "word hashing" can cause a collision. But I don't understand, how. For example, if word good is tranformed to ...
Dims's user avatar
  • 412
1 vote
0 answers
230 views

I have a school project which consists of identifying each language of a tweet from a dataset of tweets. The dataset contains tweets in Spanish, Portuguese, English, Basque, Galician and Catalan. The ...
user avatar
1 vote
2 answers
565 views

The question is pretty clear from the Title itself, why the Continuous Bag of Words (CBOW) model is called continuous. I also don't know what exactly "distributed representation" of word vectors ...
Ruchit Patel's user avatar
2 votes
1 answer
1k views

I am aware of the notion of the Dirichlet distribution, a multivariate generalization of the beta distribution. To get parameters of the Dirichlet distribution prior for bag-of-words, this CMU ...
JJJohn's user avatar
  • 2,015
3 votes
2 answers
300 views

This CMU Machine Learning Course is using the Bag-of-words model without too much explanation. wiki uses the term multiplicity to explain that model. The bag-of-words model is a simplifying ...
JJJohn's user avatar
  • 2,015
1 vote
2 answers
778 views

I have a multiclass text classification problem where I have very few documents for each class. The classes are imbalanced but I want to be able to predict the class when I have at least 200 - 300 ...
nicnaz's user avatar
  • 77

15 30 50 per page