Questions tagged [data-leakage]
The data-leakage tag has no summary.
79 questions
0 votes
0 answers
26 views
Time-based regression: is it leakage if training includes snapshots closer to the event than those used at prediction?
I’m building a regression model that predicts the final number of vehicles booked for a ferry trip. Each training row represents the state of bookings for a given trip N days before departure. Example ...
1 vote
0 answers
22 views
How to separate transformation/preprocessing of training and validation datasets in glmnet? [closed]
The cross-validation function cv.glmnet, for regularized regression, does not seem to allow for separate transformation/preprocessing of training and validation ...
2 votes
0 answers
67 views
Preventing data leakage when using street-level aggregated features in classification
I’m working with a dataset of streetlights, where each row represents a streetlight. Each streetlight has a type (LED, Incandescent, Unknown), an address, and a street name. I am trying to predict ...
0 votes
0 answers
64 views
Time series LASSO K-fold cross validation
This topic has been discussed before but I couldn't find a specific answer. Here's my approach to forecast QoQ values, Run the usual LASSO K-fold CV on timeseries data and generate a one-step ahead ...
0 votes
0 answers
41 views
How to show that data leakage leads to optimism?
Data leakage, e.g. calculating mean and standard deviation before data splitting into train and test sets, can lead to overestimating the performance of the predictive model. This can be understood ...
0 votes
1 answer
101 views
Data Leakage with STL Decomposition
I have a time series that I need to forecast with a SARIMA model. I do a train test split and then fit the SARIMA model on just the training data. I want to avoid data leakage and preserve realism, so ...
0 votes
0 answers
73 views
Rolling PCA for time-series regression: information leakage
Assume that a random variable $y_{i,t}$ is governed by some linear factors $x_{t,j}$ and a random noise term $\epsilon_{i,t}$: $$ y_{i,t} = \sum_{j}^{M+1}\beta_{j,i}x_{t,j} + \epsilon_{i,t} $$ Written ...
0 votes
0 answers
51 views
Use training or testing set when calculating sample weights for evaluation?
I have an ml model that has been built from data that is not representative of the population class frequencies. The majority class is actually undersampled, and so it more frequent in the population ...
1 vote
1 answer
178 views
How does temporal data leakage happen?
Assume I use a moving window to slice a daily stock closing price history data. Using past 7 days to predict next day. For each training instance, I'm strictly using historical data to predict future ...
3 votes
1 answer
242 views
Best Practices for Splitting Data in a Repeated Measures Classification Problem
I am working on a classification problem involving repeated measures. My objective is to classify positive patients as early as possible. In my practical application scenario, once the target becomes ...
1 vote
0 answers
95 views
How should you split up data in a train-test-validation split
I've seen it is generally recommended when using a train-test-validation data split, to first split your data into train and test datasets, and then furtherly split the train dataset into a train and ...
0 votes
0 answers
151 views
Does replacing binned variables with Weight of Evidence values introduce data leakage?
In my company I've been noticing some binary classification modeling code that replaces bins of a continuous variable with the corresponding Weight of Evidence (WoE) of the given bin. As far as I ...
2 votes
1 answer
278 views
EDA and Model Selection for Forecasting while avoiding Data Leakage
How to do EDA and model selection for time series forecasting without data leakage? Im assuming just checking for missing values is ok. But is graphing the entire time series considered data leakage? ...
1 vote
0 answers
102 views
Should I delete samples from the training data that are present in the testing data by accident?
I classify pairs of entities, let's say dog-cat pairs, whether there is association between them (positive class) or there is not (negative class). I have a moderately sized positive dataset (~130k ...
2 votes
0 answers
43 views
Data leakage: Train test split before or after data preprocessing? [duplicate]
A while ago I came across the word "data leakage" for the first time, and after some research, I found that it is a common mistake among data science/machine learning practitioners. But the ...