Questions tagged [train-test-split]
The train-test split is a method used to estimate the performance of machine learning algorithms that are applicable for prediction-based Algorithms/Applications.
128 questions
3 votes
2 answers
67 views
Should the minimum and maximum of each feature be contained in the train set for machine learning?
When using machine learning algorithms for regressions, I know that the prediction of the final model will be best when the features are within the ranges used for training, to avoid extrapolation. ...
1 vote
1 answer
91 views
Why is the Keras MNIST dataset split into training and test samples of lengths 60k and 10k respectively?
The MNIST dataset can be obtained directly using Keras by running the following lines of Python code. ...
2 votes
1 answer
94 views
LDA perplexity with train-test split leads to absurd results (best model = 1 topic)
I'm working with LDA on a Portuguese news corpus (~800k documents with an average of 28 words each after cleaning the data), and I’m trying to evaluate topic quality using perplexity. When I compute ...
0 votes
0 answers
38 views
Should I Include Post-Event Data During Training for Time-Series Prediction Models?
I’m working on a time-series prediction problem where my goal is to predict the occurrence of a complication for patients based on sequential data. 🔍 Current Approach: I have sequential data for each ...
1 vote
0 answers
42 views
Calibrated Classifier on Training Data [closed]
If I am using a GridSearchCV to find hyper parameters on a training set; if I were to run a CalibriatedClassifierCV to tune my probabilities, would it suffice to fit the CalibraitedClassifierCV with ...
0 votes
0 answers
132 views
How to properly split train/val sets for time series LSTM prediction with multiple unique items?
I am working on a time series prediction problem using an LSTM model. My dataset consists of 27 different items, each with unique IDs, and roughly the same number of samples per item. There are around ...
1 vote
0 answers
60 views
When to perform node/edge graph feature extraction in graph learning pipeline (PyTorch Geometric)?
I have a CSV file which can be converted into a PyG graph data object for an edge classification task. Before doing that, I thought of adding some features using NetworkX library. However, since after ...
0 votes
0 answers
50 views
Identify predictors for clustering output?
I have a dataset with variables collected years ago, and many variables collected this year as outcome variables. I want to combine all the variables collected this year to get one outcome, e.g. ...
2 votes
1 answer
108 views
Model Stacking Train Test Split Methdods
I am trying to validate my processes in terms of how I am engaging in model stacking for binary classification. Say I have two models as my base models, models A and B both with different classifiers ...
2 votes
2 answers
360 views
Purpose of test set in cross-validation
How does the test set in k-fold cross-validation have any purpose? The most common argument in favor of a test set I can find is to not have any data leakage between training and testing. But you don'...
8 votes
1 answer
551 views
Should out-of-sample validation also be out-of-time for time-series?
Introduction When training a model a "sample" usually refers to the data used to fit the model, so... Sample: Data used for training model Out-of-sample: Data not used for training model Out-...
0 votes
0 answers
45 views
Splitting training and test set on a time series problem
I have an OHLCV* dataset that starts on 01-01-2000 and ends on 31-12-2003 and I want to evaluate a model, say an SVM regressor. In other words, given some daily features describing the dynamics of the ...
1 vote
0 answers
75 views
What are the appropriate data splitting techniques for time-dependent sequential datasets, such as breakdown records over time?
I am working with a time-dependent sequential dataset, specifically a record of machine breakdowns over a period of time. My dataset includes data from the sensors of several machines until they fail ...
3 votes
2 answers
838 views
test & train for very very small data
I have just 25 observations. I'm not sure would it possible to test & train the data. For example 15 observations for train and 10 observations for test set. 15 observations is so small for ...
2 votes
1 answer
100 views
What is the performance of a "meta" learner that performs internally CV for model selection?
I am trying to understand the proof that reporting CV performance during model selection as performance estimate is optimistically biased. The steps in the proof are the following: Let $p_i, \pi_i$ ...