Questions tagged [data-leakage]

Question 1

I’m building a regression model that predicts the final number of vehicles booked for a ferry trip. Each training row represents the state of bookings for a given trip N days before departure. Example ...

Question 2

The cross-validation function cv.glmnet, for regularized regression, does not seem to allow for separate transformation/preprocessing of training and validation ...

Question 3

I’m working with a dataset of streetlights, where each row represents a streetlight. Each streetlight has a type (LED, Incandescent, Unknown), an address, and a street name. I am trying to predict ...

Question 4

This topic has been discussed before but I couldn't find a specific answer. Here's my approach to forecast QoQ values, Run the usual LASSO K-fold CV on timeseries data and generate a one-step ahead ...

Question 5

Data leakage, e.g. calculating mean and standard deviation before data splitting into train and test sets, can lead to overestimating the performance of the predictive model. This can be understood ...

Question 6

I have a time series that I need to forecast with a SARIMA model. I do a train test split and then fit the SARIMA model on just the training data. I want to avoid data leakage and preserve realism, so ...

Question 7

Assume that a random variable $y_{i,t}$ is governed by some linear factors $x_{t,j}$ and a random noise term $\epsilon_{i,t}$: $$ y_{i,t} = \sum_{j}^{M+1}\beta_{j,i}x_{t,j} + \epsilon_{i,t} $$ Written ...

Question 8

I have an ml model that has been built from data that is not representative of the population class frequencies. The majority class is actually undersampled, and so it more frequent in the population ...

Question 9

Assume I use a moving window to slice a daily stock closing price history data. Using past 7 days to predict next day. For each training instance, I'm strictly using historical data to predict future ...

Question 10

I am working on a classification problem involving repeated measures. My objective is to classify positive patients as early as possible. In my practical application scenario, once the target becomes ...

Question 11

I've seen it is generally recommended when using a train-test-validation data split, to first split your data into train and test datasets, and then furtherly split the train dataset into a train and ...

Question 12

In my company I've been noticing some binary classification modeling code that replaces bins of a continuous variable with the corresponding Weight of Evidence (WoE) of the given bin. As far as I ...

Question 13

How to do EDA and model selection for time series forecasting without data leakage? Im assuming just checking for missing values is ok. But is graphing the entire time series considered data leakage? ...

Question 14

I classify pairs of entities, let's say dog-cat pairs, whether there is association between them (positive class) or there is not (negative class). I have a moderately sized positive dataset (~130k ...

Question 15

A while ago I came across the word "data leakage" for the first time, and after some research, I found that it is a common mistake among data science/machine learning practitioners. But the ...

Stack Exchange Network

Questions tagged [data-leakage]

Time-based regression: is it leakage if training includes snapshots closer to the event than those used at prediction?

How to separate transformation/preprocessing of training and validation datasets in glmnet? [closed]

Preventing data leakage when using street-level aggregated features in classification

Time series LASSO K-fold cross validation

How to show that data leakage leads to optimism?

Data Leakage with STL Decomposition

Rolling PCA for time-series regression: information leakage

Use training or testing set when calculating sample weights for evaluation?

How does temporal data leakage happen?

Best Practices for Splitting Data in a Repeated Measures Classification Problem

How should you split up data in a train-test-validation split

Does replacing binned variables with Weight of Evidence values introduce data leakage?

EDA and Model Selection for Forecasting while avoiding Data Leakage

Should I delete samples from the training data that are present in the testing data by accident?

Data leakage: Train test split before or after data preprocessing? [duplicate]

Hot Network Questions