I am working on evaluating an explainability method for a text classification model that predicts whether a given text sequence contains hate speech or not.
The method outputs token-level importance scores, which indicate how decisive each token was in driving the model’s final prediction. The scores are normalized and sum to 1. As a gold standard, human annotators have labeled tokens with binary values (important/not important). These binary labels are then averaged across annotators to get a soft binary target for each token (i.e., a token labeled as important by 2 out of 3 annotators has a score of 0.67) and normalized so that they add up to 1.
I want to assess how well the model’s explanations (importance values) match the human rationales. Initially, I considered computing an AUC-type metric by sweeping a threshold over the importance scores to classify each token as "important" or not, then comparing these against the human annotations. However, I’m running into conceptual and implementation issues.
My questions here are the following:
- Does normalizing both attribution vectors make sense? What if some attributions could be negative (i.e., the token tends to drive the model towards the opposite class)?
- Is there a standard metric for comparing these types of attributions?
- Is AUC (ROC or PR) appropriate in this case, and how should I apply it?
- Are there alternative evaluation metrics or practices commonly used in explainability or rationale alignment tasks? Maybe some kind of distance metric between probability distributions could make sense?
Any guidance or examples would be greatly appreciated!