[Feature Request] Overlapping Ratio Threshold support

@aflueckiger

Overlapping Ratio

Currently, find_overlap will be True when any single overlap occurs.

Line 330 in df0e695

     if find_overlap(true_range, pred_range) and true not in true_which_overlapped_with_pred:  
 

nervaluate/src/nervaluate/utils.py

Lines 85 to 104 in df0e695

     def find_overlap(true_range: range, pred_range: range) -> set:  
   """Find the overlap between two ranges  
     
    Find the overlap between two ranges. Return the overlapping values if  
    present, else return an empty set().  
     
    Examples:  
     
    >>> find_overlap((1, 2), (2, 3))  
    2  
    >>> find_overlap((1, 2), (3, 4))  
    set()  
    """  
    
   true_set = set(true_range)  
   pred_set = set(pred_range)  
    
   overlaps = true_set.intersection(pred_set)  
    
   return overlaps  
 

However, in most cases, we hope there could be an overlapping ratio threshold.
That is something like this

pred = {'start':10, 'end':15} label = {'start':12, 'end':18} # calculate union and intersection union = {'start':10, 'end':18} intersection = {'start':12, 'end':15} #calculate ratio ratio = (15-12) / (18-10) return ratio > threshold

The current find_overlap uses set operation to find overlaps, which seems to be time inefficient. It would be directly obtained via start and end values:

Here's my implementation:

def find_overlap(self, true: dict[str, int | str], pred: dict[str, int | str]) -> bool: start_max = max(true['start'], pred['start']) end_min = min(true['end'], pred['end']) if start_max >= end_min: return False start_min = min(true['start'], pred['start']) end_max = max(true['end'], pred['end']) overlap_ratio = (end_min - start_max) / (end_max - start_min) return overlap_ratio > self.overlap_ratio_threshold

Last Character excluded

I wonder why we consider the last token, which is very counter-intuition. This comes from #32. Maybe @aflueckiger could provide any explanation on this? Does your data end includes the last character?

I think for most data, the start and end are the offsets in the original text string:
text[start:end] which means the last character is excluded. text[1:3] and text[3:5] don't have any overlapping.

nervaluate/src/nervaluate/evaluate.py

Lines 294 to 296 in df0e695

     # overlapping needs to take into account last token as well  
   pred_range = range(pred["start"], pred["end"] + 1)  
   true_range = range(true["start"], true["end"] + 1)  
 

Any support for huggingface Evaluate?

Would the maintainers consider using the standard of huggingface Evaluate? which means inheriting evaluate.Metric and pushing to huggingface hub. Afterwards, users could directly call metric = evaluate.load('{hub_url}')

Example: https://huggingface.co/spaces/evaluate-metric/glue/blob/main/glue.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Overlapping Ratio Threshold support #81

Overlapping Ratio

Last Character excluded

Any support for huggingface Evaluate?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	def find_overlap(true_range: range, pred_range: range) -> set:
	"""Find the overlap between two ranges

	Find the overlap between two ranges. Return the overlapping values if
	present, else return an empty set().

	Examples:

	>>> find_overlap((1, 2), (2, 3))
	2
	>>> find_overlap((1, 2), (3, 4))
	set()
	"""

	true_set = set(true_range)
	pred_set = set(pred_range)

	overlaps = true_set.intersection(pred_set)

	return overlaps

	# overlapping needs to take into account last token as well
	pred_range = range(pred["start"], pred["end"] + 1)
	true_range = range(true["start"], true["end"] + 1)

[Feature Request] Overlapping Ratio Threshold support #81

Description

Overlapping Ratio

Last Character excluded

Any support for huggingface Evaluate?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions