Clustering Tweet Data using DBSCAN Algorithm

Question

I am doing a tweet clustering using DBSCAN algorithm. I use all the preprocessing steps and convert sentences to vector format before applying the algorithm. However, It always puts a lot of tweets in to the '0' class. The following is the plot showing eps with number of clusters.

The following are the parameters that I pass.

dbscan = DBSCAN(eps=0.15, min_samples=2, metric='cosine').fit(x)

The following are the resulting clusters.

label -1 1221 0 1349 1 2 2 2 3 4 ... 67 3 68 3 69 2 70 2 71 2

What is the reason that class 0 getting a high number of tweets than any other classes?

Can you please share some more insight on how you are processing the words before clustering. From initial observation all your clusters might be having some word that results in the cluster. Using word2vec embeddings and Euclidean might help — mahesh ghanta
– mahesh ghanta, Commented Nov 7, 2020 at 7:39
you need to tell us about how you converted tweets into vectors. That is the key part — Kasra Manshaei
– Kasra Manshaei, Commented Nov 7, 2020 at 12:55
@mahesh ghanta: Thanks, I have used Bag of Words, TFIDF, Spacy Vectors and also, Word2Vec. All produce the cluster No '0' with a large number of results. — Nilani Algiriyage
– Nilani Algiriyage, Commented Nov 8, 2020 at 9:20
@Kasra Manshaei: Thank you. Please see the previous comment. — Nilani Algiriyage
– Nilani Algiriyage, Commented Nov 8, 2020 at 9:20
Did you check the words that are key are important in this cluster ? Do they make sense? Also could you increase the neighbouring samples to atleast 5? — mahesh ghanta
– mahesh ghanta, Commented Nov 8, 2020 at 11:52

Noah Weber · Accepted Answer · 2020-11-08 13:49:48Z

Two things: eps and quantitative representation of text.

You see that there is only for eps=0.15 a lot of clusters. But for others a lot less. This is hyper parameter that needs to be optimised (and min_samples)

And the other thing thats more important is what you use quantitative representation of text. You said Bag of Words, TFIDF, Spacy Vectors and also, Word2Vec, but did you tune them? DId you tree embeddings etc etc. There is a lot of improvement here, and when its good dbscan will function a lot better.

Stack Exchange Network

Clustering Tweet Data using DBSCAN Algorithm

1 Answer 1

Hot Network Questions

Clustering Tweet Data using DBSCAN Algorithm

1 Answer 1

Related

Hot Network Questions