I have been trying to find the correlation b/w the following type of temporal data for quite some time.
DataSet: A A B A B C B C B A B C X Y X Y Z X Y X Z Y Z A B A B C B (The actual dataset has 1000's of unique points)
I have to find some sort of similarity value b/w these points (my points I mean 'A', 'B', 'X', etc). For example similarity b/w 'A' and 'B' in above dataset is high so similarity('A', 'B') should return me a high value and similarity('A', 'X') should return a low value.
I am trying to calculate the similarity so that I can cluster the similar points. (For above dataset, possible good clusters are {A, B, C}, {X, Y, Z})
Till now I have tried the following approaches:
I tried to find segments of data whose
correlation = len(segment)/unique(points in segment)value is high. If the value is high than that means in this segment the same type of points are coming back and forth and therefore highly correlated. The problem with this correlation formula is that it is not robust enough. For exampleDataset is: X Y X X X X X Y X. Now the correlation value for subsegment with starting index = 3 and ending index = 7 is higher (correlation = 5/1) than the correlation value of the whole Dataset i.e 1 to end (correlation = 9/2). Now due to such problems I have to do a lot of manual settings, like select a segment having more unique points if correlation diff b/w those segments is less than some manually selected value (tradeoff b/w correlation value and unique points), and this does not work when I have 1000's of unique points.Word2Vec: I ran this algorithm on the whole data as a single sentence (i.e merged the data in a single sentence separated by space). I don't exactly understand this algorithm but it seems to not work on a single sentence. It gives > 0.99 similarity b/w almost all points. I don't know why is this not working since it is supposed to learn the similarity b/w words based on how they are occurring in a sentence.
How should I approach this clustering problem? Can I tweak word2vec to fit for my problem? A better mathematical formula? Any other way?
Please correct me if I am doing something wrong.