I'm training a classifier on the DAIGT dataset. The objective is to differentiate human from AI text and so this is a binary classification problem. As a baseline before I move onto an LLM classifier, I am using a pipeline of a TF-IDF vectorizer and then a logistic regression classifier. However, when I try classifying the data this way I get extremely high scoring metrics. For example, the following code:
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression from sklearn.pipeline import Pipeline from sklearn.metrics import classification_report from sklearn.model_selection import cross_validate from sklearn.model_selection import StratifiedKFold kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) for train_idx, test_idx in kf.split(daigt_v2["text"], daigt_v2["label"]): X_train, y_train = daigt_v2.iloc[train_idx]["text"], daigt_v2.iloc[train_idx]["label"] X_test, y_test = daigt_v2.iloc[test_idx]["text"], daigt_v2.iloc[test_idx]["label"] baseline = Pipeline([ ('tfidf', TfidfVectorizer()), ('clf', LogisticRegression()) ]) baseline.fit(X_train, y_train) y_pred = baseline.predict(X_test) print(classification_report(y_test, y_pred, target_names=["Human", "AI"])) gives the following output:
precision recall f1-score support Human 0.99 1.00 0.99 5475 AI 1.00 0.98 0.99 3499 accuracy 0.99 8974 macro avg 0.99 0.99 0.99 8974 weighted avg 0.99 0.99 0.99 8974 precision recall f1-score support Human 0.99 1.00 0.99 5474 AI 0.99 0.98 0.99 3500 accuracy 0.99 8974 macro avg 0.99 0.99 0.99 8974 weighted avg 0.99 0.99 0.99 8974 precision recall f1-score support Human 0.99 1.00 0.99 5474 AI 1.00 0.98 0.99 3500 accuracy 0.99 8974 macro avg 0.99 0.99 0.99 8974 weighted avg 0.99 0.99 0.99 8974 precision recall f1-score support Human 0.99 1.00 0.99 5474 AI 0.99 0.98 0.99 3499 accuracy 0.99 8973 macro avg 0.99 0.99 0.99 8973 weighted avg 0.99 0.99 0.99 8973 precision recall f1-score support Human 0.99 1.00 0.99 5474 AI 1.00 0.98 0.99 3499 accuracy 0.99 8973 macro avg 0.99 0.99 0.99 8973 weighted avg 0.99 0.99 0.99 8973 So we see that there is 0.99 f1 score and 0.99 classification accuracy. Which obviously seems way to high. However when I try using cross_validate like this:
baseline = Pipeline([ ('tfidf', TfidfVectorizer()), ('clf', LogisticRegression()) ]) scores = cross_validate(baseline, daigt_v2["text"], daigt_v2["label"], cv=10, scoring=["accuracy", "f1", "recall", "precision", "roc_auc", "average_precision"]) summary = {key : float(np.mean(value)) for key, value in scores.items()} summary returns as:
{'fit_time': 13.48662896156311, 'score_time': 5.418254947662353, 'test_accuracy': 0.8590308329341341, 'test_f1': 0.8367589483608666, 'test_recall': 0.9277524353897032, 'test_precision': 0.7674348038361346, 'test_roc_auc': 0.9595275583634191, 'test_average_precision': 0.9446004784576681} Which are much more modest scores. Obviously I trust the second result better, but can anyone explain the discrepency here?
n_splits=5, and the second one usescv=10. $\endgroup$baseline.fit(X_train.to_numpy(), y_train).X_train.to_numpy()should contain just the text and no other information that might be revealing the label. Wondering if there could be data leakage of some sort. $\endgroup$X_train.to_numpy()but it hasn't changed anything. $\endgroup$