2
$\begingroup$

I'm trying to locally replicate the pair classification task of MMTEB/MTEB. However, I didn't find train/dev sets for all datasets in this task.

Table 2 in the original MTEB paper (Mueninghoff et al, 2023) shows that there is no train data for the 3 pair classification datasets and only SprintDuplicateQuestions has a Dev data: enter image description here

However, the original MTEB paper also states on page 3 that an optimal binary threshold is determined:

A pair of text inputs is provided and a label needs to be assigned. Labels are typically binary variables denoting duplicate or paraphrase pairs. The two texts are embedded and their distance is computed with various metrics (cosine similarity, dot product, euclidean distance, manhattan distance). Using the best binary threshold accuracy, average precision, f1, precision and recall are computed. The average precision score based on cosine similarity is the main metric.

So I am wondering what data is used to determine that threshold?

Also, on page 3 Muenninghof et al (2023) say that various distance and performance metrics are used to find the optimal cutoff value (see quote above). But what is the exact algorithm to select the threshold considering the authors apply multiple metrics?

$\endgroup$

1 Answer 1

1
$\begingroup$

MTEB does not train any model to determine the threshold. Instead, it uses the average precision as the main metric which does not require any threshold since it

summarizes a precision-recall curve as the weighted mean of precisions achieved at each threshold [...].

However, Muenninghof et al (2023) do indeed report the other metrics (accuracy, f1, precision and recall) based on the best binary threshold. These thresholds are determined by iteratively searching the test data as can be seen in their Github repository.

This method, for example, is used for accuracy:

def find_best_acc_and_threshold(scores, labels, high_score_more_similar: bool): assert len(scores) == len(labels) rows = list(zip(scores, labels)) rows = sorted(rows, key=lambda x: x[0], reverse=high_score_more_similar) max_acc = 0 best_threshold = -1 positive_so_far = 0 remaining_negatives = sum(np.array(labels) == 0) for i in range(len(rows) - 1): score, label = rows[i] if label == 1: positive_so_far += 1 else: remaining_negatives -= 1 acc = (positive_so_far + remaining_negatives) / len(labels) if acc > max_acc: max_acc = acc best_threshold = (rows[i][0] + rows[i + 1][0]) / 2 return max_acc, best_threshold 

The quote from the paper you provided is, hence, inaccurate. The average precision is not found by searching for the best binary threshold. Rather it summarizes the Precision-Recall-Curve, i.e. all possible thresholds.

The other metrics (accuracy, f1, precision and recall) need to be taken with some caution since they are found by searching the test data and are not determined based on train data.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.