Identifying which known groups are the most similar or most dissimilar

Question

I have a data set of 5 groups and their associations to different interests. The data is structured as follows with 2000+ interests and 5 known groups:

 Interest Group1 Group2 Group3 Group4 Group5 01. Sports 10% 40% 30% 80% 65% 02. Music 35% 60% 55% 10% 15% 03. Fashion 80% 10% 75% 5% 25% . . . 1998. Movies 70% 20% 45% 50% 65% 1999. Books 50% 70% 35% 65% 95% 2000. Art 90% 50% 25% 15% 30%

I would like to gain an understanding of what would be the best way to determine which groups are closest to each other and which ones are the most different from each other in terms of interests.

I am leaning towards cluster analysis, however the structure of my data is throwing me off.

Any suggestions would be greatly appreciated.

Has QUIT--Anony-Mousse · Accepted Answer · 2017-11-06 02:52:45Z

2

For tiny sample sizes (5), hierarchical clustering and dendrograms work best.

As similarity measure, I'd go with Manhattan distance.

answered Nov 6, 2017 at 2:52

Has QUIT--Anony-Mousse

8,1541 gold badge16 silver badges31 bronze badges

1

$\begingroup$ Thanks @Anony-Mousse after spending some time reviewing hierarchic clustering and dendrograms, I have produced a succinct visual displaying the similar and dissimilar groups. Appreciate the input! $\endgroup$

hc_ds
– hc_ds

2017-11-06 23:31:19 +00:00
Commented Nov 6, 2017 at 23:31

Add a comment |

timleathart · Accepted Answer · 2017-11-05 22:55:12Z

You only have five groups, so full blown clustering is probably not a good idea here, but looking at similarity scores between the group vectors may be insightful. An easy one to try at first would be cosine similarity, which essentially measures the angle between each of your group vectors:

similarity = $cos(\theta)$ = $\frac{\mathbf{A} \cdot \mathbf{B}}{||\mathbf{A}||_2 ||\mathbf{B}||_2}$

where $\mathbf{A} \cdot \mathbf{B}$ gives the dot product of two vectors, and $||\mathbf{A}||_2$ is the magnitude of a vector $\mathbf{A}$.

All of your vector values are positive, so the result will be in the range $(0, 1)$. The closer this value is to $1$, the more similar the vectors. $0$ means they are completely decorrelated (vectors are orthogonal). You can compare the cosine similarities of each pair of groups to work out which ones are most similar/dissimilar.

Why do you think cosine would be good on such data? Doesn't the denominator cause unwanted distortion here - these are not text documents. — Has QUIT--Anony-Mousse
– Has QUIT--Anony-Mousse, Commented Nov 6, 2017 at 2:54
I'm not an expert on clustering or similarity metrics, it just seemed like a good place to start. My understanding is that cosine similarity works better than things like euclidean distance on high dimensional data. You seem to be a lot more qualified than I on clustering -- why does the denom introduce distortion? I thought that because all of the features are in the same scale (0.0-1.0) it would be fine. — timleathart
– timleathart, Commented Nov 6, 2017 at 9:33
Cosine is equivalent to Euclidean on L2 normalized vectors. So it cannot have a systemic advantage for high dimensional data. L2 normalization is a good idea to account for different document lengths, but here it means you scale percentage values (that have a well defined scale) by some odd (inverse sum of squares) aggregate scaling factor. — Has QUIT--Anony-Mousse
– Has QUIT--Anony-Mousse, Commented Nov 6, 2017 at 20:32

Valentin Calomme · Accepted Answer · 2017-11-09 10:09:52Z

Since you only have 5 groups, you should probably look at distances instead of clustering.

Now the question is: which distance should you use.

Of course you could use Euclidean, Manhattan, or cosine distances and be done with it. In this case, I'd pick cosine distance because it reduces the impact of a single dimension on the overall distance and since you have 2000 features/interests, it might help.

Now, I'm guessing that the interest groups are somewhat correlated. In an extreme case, 1999 interests are very correlated, and 1 isn't. If this happens and you use regular distances, it means that two groups who only agree on the 1999 interests will be considered much closer than the ones who only agree on the 1 interest. Even though, you know that you're only dealing with two interests here.

So you might want to use some form of weighting to compute your distances using the correlation between the interest groups. Perhaps the more unique groups should matter more than the groups who are very similar to other interests. To do that, you could use dimensionality reduction techniques like PCA. PCA will "remove" redundant interests and "group" them into one. Once you have reduced the dimensionality of your data (let's say that now you are looking at 20 interests instead of 2000), you can compute your distances.

Of course, distances are subjective and you have to define how much agreeing on specific interests matters. Perhaps agreeing on sports matters more than agreeing on books. If you have this prior knowledge, you'd have to manually input it into your weights.

Once you've decided on how to compute your distances, you could compute multiple distance matrices by sampling subsets of the features (interests). That way, groups that are far away from each other will remain far away most of the time, same thing with groups that are close to each other. You could then look at the average distance to decide how far away the groups are from each other.

thanks for the input. I have spent some time reviewing weighted distance calculations and I wanted to confirm that I would calculate the distance and then apply the weighting correct? Next question, would weighting based on the number of individuals within each group be acceptable? I am not familiar with weighting based on correlations, any chance you could please expand on this? — hc_ds
– hc_ds, Commented Nov 9, 2017 at 7:11
I was more referring to the overall concept that correlation is information that can play a role in the weighting. I wasn't referring to a specific technique. I edited my answer and added a bit about PCA, which as far as I know, is the most objective way to deal with redundancy. — Valentin Calomme
– Valentin Calomme, Commented Nov 9, 2017 at 10:14
Brilliant! I already added a PCA component to my analysis so I will continue to that. Cheers for clarifying — hc_ds
– hc_ds, Commented Nov 9, 2017 at 19:08

Stack Exchange Network

Identifying which known groups are the most similar or most dissimilar

3 Answers 3

Hot Network Questions

Identifying which known groups are the most similar or most dissimilar

3 Answers 3

Related

Hot Network Questions