Much higher scoring metrics with classification_report than cross_validate

Question

I'm training a classifier on the DAIGT dataset. The objective is to differentiate human from AI text and so this is a binary classification problem. As a baseline before I move onto an LLM classifier, I am using a pipeline of a TF-IDF vectorizer and then a logistic regression classifier. However, when I try classifying the data this way I get extremely high scoring metrics. For example, the following code:

from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression from sklearn.pipeline import Pipeline from sklearn.metrics import classification_report from sklearn.model_selection import cross_validate from sklearn.model_selection import StratifiedKFold kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) for train_idx, test_idx in kf.split(daigt_v2["text"], daigt_v2["label"]): X_train, y_train = daigt_v2.iloc[train_idx]["text"], daigt_v2.iloc[train_idx]["label"] X_test, y_test = daigt_v2.iloc[test_idx]["text"], daigt_v2.iloc[test_idx]["label"] baseline = Pipeline([ ('tfidf', TfidfVectorizer()), ('clf', LogisticRegression()) ]) baseline.fit(X_train, y_train) y_pred = baseline.predict(X_test) print(classification_report(y_test, y_pred, target_names=["Human", "AI"]))

gives the following output:

 precision recall f1-score support Human 0.99 1.00 0.99 5475 AI 1.00 0.98 0.99 3499 accuracy 0.99 8974 macro avg 0.99 0.99 0.99 8974 weighted avg 0.99 0.99 0.99 8974 precision recall f1-score support Human 0.99 1.00 0.99 5474 AI 0.99 0.98 0.99 3500 accuracy 0.99 8974 macro avg 0.99 0.99 0.99 8974 weighted avg 0.99 0.99 0.99 8974 precision recall f1-score support Human 0.99 1.00 0.99 5474 AI 1.00 0.98 0.99 3500 accuracy 0.99 8974 macro avg 0.99 0.99 0.99 8974 weighted avg 0.99 0.99 0.99 8974 precision recall f1-score support Human 0.99 1.00 0.99 5474 AI 0.99 0.98 0.99 3499 accuracy 0.99 8973 macro avg 0.99 0.99 0.99 8973 weighted avg 0.99 0.99 0.99 8973 precision recall f1-score support Human 0.99 1.00 0.99 5474 AI 1.00 0.98 0.99 3499 accuracy 0.99 8973 macro avg 0.99 0.99 0.99 8973 weighted avg 0.99 0.99 0.99 8973

So we see that there is 0.99 f1 score and 0.99 classification accuracy. Which obviously seems way to high. However when I try using cross_validate like this:

baseline = Pipeline([ ('tfidf', TfidfVectorizer()), ('clf', LogisticRegression()) ]) scores = cross_validate(baseline, daigt_v2["text"], daigt_v2["label"], cv=10, scoring=["accuracy", "f1", "recall", "precision", "roc_auc", "average_precision"]) summary = {key : float(np.mean(value)) for key, value in scores.items()}

summary returns as:

{'fit_time': 13.48662896156311, 'score_time': 5.418254947662353, 'test_accuracy': 0.8590308329341341, 'test_f1': 0.8367589483608666, 'test_recall': 0.9277524353897032, 'test_precision': 0.7674348038361346, 'test_roc_auc': 0.9595275583634191, 'test_average_precision': 0.9446004784576681}

Which are much more modest scores. Obviously I trust the second result better, but can anyone explain the discrepency here?

Does a similar thing happen when you use the same number of folds in each? Currently the first one uses n_splits=5, and the second one uses cv=10. — MuhammedYunus
– MuhammedYunus, Commented May 31 at 12:21
In the top version, try changing the line to baseline.fit(X_train.to_numpy(), y_train). X_train.to_numpy() should contain just the text and no other information that might be revealing the label. Wondering if there could be data leakage of some sort. — MuhammedYunus
– MuhammedYunus, Commented May 31 at 15:16
I added X_train.to_numpy() but it hasn't changed anything. — saladmobster
– saladmobster, Commented May 31 at 15:23

MuhammedYunus · Accepted Answer · 2025-05-31 16:22:39Z

When you supply cv=<int> to cross_validate(), it will use a splitting regime without shuffling; shuffle=False by default. Since the rows of your dataset are ordered (roughly 50% label=0 followed by label=1), the model gets trained on label=0 data before being tested on label=1, which skews the results.

One solution is to define a splitter, and use it for both of your code snippets:

#Define a splitter for all CV analyses splitter = StratifiedKFold(5, shuffle=True, random_state=0) . . ... = cross_validate(..., cv=splitter)

Note that random_state=0 will ensure that it randomises the same way each call.

You could alternatively shuffle your data upon loading, which then permits you to use a non-randomising splitter like in cross_validate(..., cv=5).

The default for classification is a StratifiedKFold, so I don't think this should be the issue (each fold will have similar target rates)? — Ben Reiniger
– Ben Reiniger ♦, Commented May 31 at 17:08
Excellent detective work! This was it. Making your changes resulted in the cross_validate metrics rising to match the classification_report metrics — saladmobster
– saladmobster, Commented May 31 at 17:08
Glad it worked @saladmobster. Interesting point @BenReiniger. I suppose with shuffle=False we're forcing the stratification upon contiguous blocks, which becomes ineffective? — MuhammedYunus
– MuhammedYunus, Commented May 31 at 17:16
My working theory is that since StratifiedKFold with shuffle=False will grab both values of the target class in order. Meaning that, for example, entire prompts or AI-model types will be excluded from the training/test data — saladmobster
– saladmobster, Commented May 31 at 18:04

Damaraju Pavan Kumar · Accepted Answer · 2025-06-06 04:20:53Z

I do not think the process is wrong as there is a flaw in computing f1 value in confusion matrix calculation of Python. I have mentioned the same around 5 years back and hope it was resolved by this time. Seeing the results, I think that bug still is existing in the software. Let me explain why I am thinking the way... I have developed a Logit model for a business problem, and accuracy came as 100% through python. The moment, I saw as 100%, I am sure, being statistician by education, I felt,somewhere I am missing something and hence, I calculated the results against actuals, and I received only 96% accuracy. Further digging, I found that, instead at each row difference, it is finding the difference of columns (Results and actuals). For example: Actuals : [1,0,1,0] and results thru code: [0,0,1,1] by looking at these two matrices, the accuracy is 75% but if you sum it up both gives the answer 2, and difference is 0, hence, python code provide the accuracy as 100%. I hope I am clear in way of explaining this. I wish I might be wrong in understanding the calculation by Python, but during my understanding the code of Logit model by python, this is what I understood. Brickbats on my understanding is welcome :-)

Stack Exchange Network

Much higher scoring metrics with classification_report than cross_validate

2 Answers 2

Hot Network Questions

Much higher scoring metrics with classification_report than cross_validate

2 Answers 2

Related

Hot Network Questions