Clustering of Temporal Data

Question

I have been trying to find the correlation b/w the following type of temporal data for quite some time.

DataSet: A A B A B C B C B A B C X Y X Y Z X Y X Z Y Z A B A B C B (The actual dataset has 1000's of unique points)

I have to find some sort of similarity value b/w these points (my points I mean 'A', 'B', 'X', etc). For example similarity b/w 'A' and 'B' in above dataset is high so similarity('A', 'B') should return me a high value and similarity('A', 'X') should return a low value.

I am trying to calculate the similarity so that I can cluster the similar points. (For above dataset, possible good clusters are {A, B, C}, {X, Y, Z})

Till now I have tried the following approaches:

I tried to find segments of data whose correlation = len(segment)/unique(points in segment) value is high. If the value is high than that means in this segment the same type of points are coming back and forth and therefore highly correlated. The problem with this correlation formula is that it is not robust enough. For example Dataset is: X Y X X X X X Y X. Now the correlation value for subsegment with starting index = 3 and ending index = 7 is higher (correlation = 5/1) than the correlation value of the whole Dataset i.e 1 to end (correlation = 9/2). Now due to such problems I have to do a lot of manual settings, like select a segment having more unique points if correlation diff b/w those segments is less than some manually selected value (tradeoff b/w correlation value and unique points), and this does not work when I have 1000's of unique points.
Word2Vec: I ran this algorithm on the whole data as a single sentence (i.e merged the data in a single sentence separated by space). I don't exactly understand this algorithm but it seems to not work on a single sentence. It gives > 0.99 similarity b/w almost all points. I don't know why is this not working since it is supposed to learn the similarity b/w words based on how they are occurring in a sentence.

How should I approach this clustering problem? Can I tweak word2vec to fit for my problem? A better mathematical formula? Any other way?

Please correct me if I am doing something wrong.

I don't think this is a clustering problem, but finding common subsequences? Word2vec clearly is nonsense to use here. — Has QUIT--Anony-Mousse
– Has QUIT--Anony-Mousse, Commented Jun 4, 2016 at 21:32
@Anony-Mousse, can you please elaborate more here. I don't understand why word2vec is a bad idea. Word2vec aims at doing something similar to what I am doing. About common subsequences, I don't think there will ever be an exactly same subsequence two times (except for very small lengths (<3~4) of subsequences) — Luv Agarwal
– Luv Agarwal, Commented Jun 5, 2016 at 9:31
word2vec is entirely appropriate here. Did you inspect the resulting embedding, and with what similarity function? — Emre
– Emre, Commented Oct 25, 2016 at 22:03

Pete · Accepted Answer · 2016-10-25 20:31:24Z

Have a look at Deep Walk. It's essentially word-to-vec applied to "sentences" created from a graph (nodes and edges). If you can express your data as a graph, you can apply it.

Stack Exchange Network

Clustering of Temporal Data

1 Answer 1

Hot Network Questions

Clustering of Temporal Data

1 Answer 1

Related

Hot Network Questions