Skip to main content

Questions tagged [data-leakage]

0 votes
0 answers
26 views

I’m building a regression model that predicts the final number of vehicles booked for a ferry trip. Each training row represents the state of bookings for a given trip N days before departure. Example ...
vpvinc's user avatar
  • 1
1 vote
0 answers
22 views

The cross-validation function cv.glmnet, for regularized regression, does not seem to allow for separate transformation/preprocessing of training and validation ...
DriesB's user avatar
  • 11
2 votes
0 answers
67 views

I’m working with a dataset of streetlights, where each row represents a streetlight. Each streetlight has a type (LED, Incandescent, Unknown), an address, and a street name. I am trying to predict ...
setty's user avatar
  • 161
0 votes
0 answers
64 views

This topic has been discussed before but I couldn't find a specific answer. Here's my approach to forecast QoQ values, Run the usual LASSO K-fold CV on timeseries data and generate a one-step ahead ...
bebgejo's user avatar
0 votes
0 answers
41 views

Data leakage, e.g. calculating mean and standard deviation before data splitting into train and test sets, can lead to overestimating the performance of the predictive model. This can be understood ...
Antonios Sarikas's user avatar
0 votes
1 answer
101 views

I have a time series that I need to forecast with a SARIMA model. I do a train test split and then fit the SARIMA model on just the training data. I want to avoid data leakage and preserve realism, so ...
Robertmg's user avatar
  • 143
0 votes
0 answers
73 views

Assume that a random variable $y_{i,t}$ is governed by some linear factors $x_{t,j}$ and a random noise term $\epsilon_{i,t}$: $$ y_{i,t} = \sum_{j}^{M+1}\beta_{j,i}x_{t,j} + \epsilon_{i,t} $$ Written ...
deblue's user avatar
  • 399
0 votes
0 answers
51 views

I have an ml model that has been built from data that is not representative of the population class frequencies. The majority class is actually undersampled, and so it more frequent in the population ...
HaplessEcologist's user avatar
1 vote
1 answer
178 views

Assume I use a moving window to slice a daily stock closing price history data. Using past 7 days to predict next day. For each training instance, I'm strictly using historical data to predict future ...
yang's user avatar
  • 149
3 votes
1 answer
242 views

I am working on a classification problem involving repeated measures. My objective is to classify positive patients as early as possible. In my practical application scenario, once the target becomes ...
jpsca1293's user avatar
1 vote
0 answers
95 views

I've seen it is generally recommended when using a train-test-validation data split, to first split your data into train and test datasets, and then furtherly split the train dataset into a train and ...
sammcm998's user avatar
0 votes
0 answers
151 views

In my company I've been noticing some binary classification modeling code that replaces bins of a continuous variable with the corresponding Weight of Evidence (WoE) of the given bin. As far as I ...
jglad's user avatar
  • 33
2 votes
1 answer
278 views

How to do EDA and model selection for time series forecasting without data leakage? Im assuming just checking for missing values is ok. But is graphing the entire time series considered data leakage? ...
pandashelp's user avatar
1 vote
0 answers
102 views

I classify pairs of entities, let's say dog-cat pairs, whether there is association between them (positive class) or there is not (negative class). I have a moderately sized positive dataset (~130k ...
oliver.c's user avatar
  • 185
2 votes
0 answers
43 views

A while ago I came across the word "data leakage" for the first time, and after some research, I found that it is a common mistake among data science/machine learning practitioners. But the ...
jairiidriss's user avatar

15 30 50 per page
1
2 3 4 5 6