Cluster documents and identify the prominent document in the cluster?

Question

I have a set of documents as given in the example below.

doc1 = {'Science': 0.7, 'History': 0.05, 'Politics': 0.15, 'Sports': 0.1} doc2 = {'Science': 0.3, 'History': 0.5, 'Politics': 0.1, 'Sports': 0.1}

I want to cluster the documents and identify the most prominent document within the cluster.

e.g, cluster 1 includes = {doc1, doc4, doc5. doc8} and I want to get the most prominent document that represents this cluster (e.g., doc8). (or to identify the main theme of the cluster)

Please let me know a suitable approach to achieve this :)

How do you want to define prominence? How about proximity to the cluster centroid? It sounds like you want a graph theoretic definition, but you don't seem to have social data. — Emre
– Emre, Commented Jul 6, 2017 at 4:17
I have a cosine similarity matrix and used DBSCAN for it to cluster the documents. Now I want to know what is the most representative document in a given cluster in order to identify the main theme of the cluster :) — Smith
– Smith, Commented Jul 6, 2017 at 6:12

Bogas · Accepted Answer · 2017-07-06 13:53:45Z

A very simple approach would be to find some kind of centroid for each cluster (e.g. averaging the distributions of the documents belonging to each cluster respectively) and then calculating the cosine distance of each document within the cluster from the corresponding centroid. The document with the shorter distance will be the closest to the centroid, hence the most "representative".

Continuing from the previous example:

import pandas as pd import numpy as np from sklearn.metrics import pairwise_distances from scipy.spatial.distance import cosine from sklearn.cluster import DBSCAN from sklearn.preprocessing import StandardScaler # Initialize some documents doc1 = {'Science':0.8, 'History':0.05, 'Politics':0.15, 'Sports':0.1} doc2 = {'News':0.2, 'Art':0.8, 'Politics':0.1, 'Sports':0.1} doc3 = {'Science':0.8, 'History':0.1, 'Politics':0.05, 'News':0.1} doc4 = {'Science':0.1, 'Weather':0.2, 'Art':0.7, 'Sports':0.1} collection = [doc1, doc2, doc3, doc4] df = pd.DataFrame(collection) # Fill missing values with zeros df.fillna(0, inplace=True) # Get Feature Vectors feature_matrix = df.as_matrix() # Fit DBSCAN db = DBSCAN(min_samples=1, metric='precomputed').fit(pairwise_distances(feature_matrix, metric='cosine')) labels = db.labels_ n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0) print('Estimated number of clusters: %d' % n_clusters_) # Find the representatives representatives = {} for label in set(labels): # Find indices of documents belonging to the same cluster ind = np.argwhere(labels==label).reshape(-1,) # Select these specific documetns cluster_samples = feature_matrix[ind,:] # Calculate their centroid as an average centroid = np.average(cluster_samples, axis=0) # Find the distance of each document from the centroid distances = [cosine(sample_doc, centroid) for sample_doc in cluster_samples] # Keep the document closest to the centroid as the representative representatives[label] = cluster_samples[np.argsort(distances),:][0] for label, doc in representatives.iteritems(): print("Label : %d -- Representative : %s" % (label, str(doc)))

However, while running the code I get an error saying "AttributeError: 'dict' object has no attribute 'iteritems'". Do you know how to fix it? :) — Smith
– Smith, Commented Jul 7, 2017 at 0:26
Please let me know if I should use cosine distance or 1 - cosine distance (in other words cosine similarity) in the fit parameter of DBSCAN? DBSCAN(min_samples=1, metric='precomputed').fit(pairwise_distances(feature_matrix, metric='cosine')) — Smith
– Smith, Commented Jul 7, 2017 at 3:45
@Smith, according to the sklearn.DBSCAN fit documentation you should use a distance matrix as input, not a similarity matrix.. You will need to play around with the min_samples, eps parameters according to your data.. — Bogas
– Bogas, Commented Jul 7, 2017 at 7:56
Please let me know if you know an answer for this datascience.stackexchange.com/questions/20255/… — Smith
– Smith, Commented Jul 9, 2017 at 3:49

Stack Exchange Network

Cluster documents and identify the prominent document in the cluster?

1 Answer 1

Linked

Hot Network Questions

Cluster documents and identify the prominent document in the cluster?

1 Answer 1

Linked

Related

Hot Network Questions