Consider the following data.
Groundtruth | Dataset1 | Dataset2 | Dataset3 Datapoints|Time | Datapoints|Time | Datapoints|Time | Datapoints|Time A |0 | a |0 | a |0 | a |0 B |10 | b |5 | b |5 | b |13 C |15 | c |12 | c |12 | c |21 D |25 | d |22 | d |14 | d |30 E |30 | e |30 | e |17 | | | f |27 | | | g |30 | Visualized like this (as in number of - between each identifier):
Time -> Groundtruth: A|----------|B|-----|C|----------|D|-----|E Dataset1: a|-----|b|-------|c|----------|d|--------|e Dataset2: a|-----|b|-------|c|--|d|---|e|----------|f|---|g Dataset3: a|-------------|b|--------|c|---------|d My goal is to compare the datasets with the groundtruth. I want to create a function that generates a similarity measurement between one of the datasets and the groundtruth in order to evaluate how good my segmentation algorithm is. Obviously I would like the segmentation algorithm to consist of equal number of datapoints(segments) as the groundtruth but as illustrated with the datasets this is not a guarantee, neither is the number of datapoints known ahead of time.
I've already created a Jacard Index to generate a basic evaluation score. But I am now looking into an evaluation method that punish the abundance/absence of datapoints as well as limit the distance to a correct datapoint. That is, b doesn't have to match B, it just has to be close to a correct datapoint.
I've tried to look into a dynamic programming method where I introduced a penalty for removing or adding a datapoint as well as a distance penalty to move to the closest datapoint. I'm struggling though, due to: 1. I need to limit each datapoint to one correct datapoint 2. Figure out which datapoint to delete if needed 3. General lack of understanding in how to implement DP algorithms
Anyone have ideas how to do this? If dynamic programming is the way to go, I'd love some link recommendation as well as some pointers in how to go about it.