Questions tagged [data-leakage]
The data-leakage tag has no summary.
65 questions
5 votes
1 answer
69 views
Does using test data in eval_set argument for xgboost cause data leakage?
I'm using an early stopping for XGBClassifier. The fitting looks like this (simplified): ...
3 votes
2 answers
144 views
Much higher scoring metrics with classification_report than cross_validate
I'm training a classifier on the DAIGT dataset. The objective is to differentiate human from AI text and so this is a binary classification problem. As a baseline before I move onto an LLM classifier, ...
1 vote
1 answer
163 views
XGBoost CV confusion on how to choose eval set
If I am using XGBoost with GridSearchCV, how should I choose my evaluation set? Note, I am referring to eval_set within the model params. My current implementation is using GridSearchCV in order to ...
6 votes
1 answer
1k views
How high of a correlation coefficient of a feature with a target variable is considered too high?
Currently my classification model is doing too well on all of the train, validation, and test datasets. I'm assuming there is a data leakage in the features, and therefore I've computed the ...
0 votes
1 answer
313 views
Why does Time series split cause data leakage from future data?
There are lots of websites saying time series split may cause data leakage. The idea for time series splits is to divide the training set into two folds at each iteration on condition that the ...
0 votes
1 answer
79 views
Is it unethical to gather data from data leaks about demographics?
Sorry if this is the wrong SE, but in my mind it made the most sense to ask this here. My question is related to specifically collecting information about a target demographic, not individuals which ...
1 vote
1 answer
87 views
Order of preproccesing, avoiding leakage and metrics
I have a dataset with ~40k records and 16 columns (including the target) and I want to understand the correct process behind whole data science proccess. This is what I did: Performed an EDA which ...
0 votes
1 answer
100 views
using a feature that is only available during training
I'm working on a project that aims to classify JIRA issues into their relevant owner group. An issue has the following text features: Summary Description Comments all of which are text based. During ...