Questions tagged [data-preprocessing]
A step of cleaning data in data mining for analysis purposes
527 questions
0 votes
0 answers
32 views
Outlier detection in many short time series
I have a dataset with ~20.000 entries containing mean values for different groups. The groups are defined with 4 categorical columns and I have the week number, the number of samples per week and the ...
1 vote
0 answers
22 views
How to separate transformation/preprocessing of training and validation datasets in glmnet? [closed]
The cross-validation function cv.glmnet, for regularized regression, does not seem to allow for separate transformation/preprocessing of training and validation ...
3 votes
1 answer
32 views
Why are all my tuned models (DT, GB, SVM) plateauing at ~70% F1 after rigorous data cleaning and feature engineering?
I'm working on a classification problem where the goal is to maximize the F1-score, hopefully above 80%. Despite a very thorough EDA and preprocessing workflow, I've hit a hard performance ceiling ...
0 votes
0 answers
77 views
Fitting mixed effect model to factorial survey data
I am currently conducting an online survey in a factorial setting ("vignette study"). I have 8 vignettes in total, varying in three dimensions (let us call them Dimension A, Dimension B and ...
0 votes
0 answers
44 views
Are there clustering algorithms or preprocessing strategies tailored for zero-inflated and continuous data types?
I am currently working on the project where I need to assign customers across N recipes before AB testing such that KPIs for each customer are balanced across recipes (reduce pre-test bias) Dataset ...
4 votes
1 answer
133 views
When and how can unsupervised preprocessing before splitting data lead to overoptimistic model performance?
Conceptually, I understand that models should be built totally blind to the test set in order to most faithfully estimate performance on future data. However, I'm struggling to understand the extent ...
1 vote
0 answers
36 views
NIR spectra preprocessing - two point linear baseline correction -OPLS
I am doing an analysis on NIR spectra of which I am trying to measure a physical property which I mostly expect to be scatter. However my samples have a complex surface morphology and I need some ...
1 vote
0 answers
72 views
"How to validate if a dataset has natural clusters?"
I've recently learnt unsupervised learning methods such as KMeans and DBSCAN. While working on this dataset, I applied KMeans clustering but faced the following issues: The Elbow Method showed no ...
3 votes
2 answers
576 views
How can I apply KMeans clustering if all variables are highly uncorrelated
I'm applying K-Means clustering to a dataset of ship voyages. The goal is to group voyages into performance-based clusters like cost-efficient, underperforming, etc. I have 12 features in total: 10 ...
3 votes
1 answer
189 views
Large errors with log-transformed Gaussian process regression?
I am working with some data in which the output target values $(Y)$ are all strictly positive values, essentially in the range of 0.001 to 100. Since these values can inherently never be negative or ...
1 vote
0 answers
111 views
Modifying Gaussian Processes and/or using transformations for dealing with positive-only output values? [closed]
I've been reading into different Gaussian processes recently to better fit some data that I'm working with. My data clearly does not follow a multivariate Gaussian as required for a standard exact ...
0 votes
1 answer
98 views
Outlier Removal from only One Class in a binary classification problem
Can outlier removal be done only on one class in a binary classification problem? when facing with class imbalance for example, can it be done only on majority class? if so, is there any paper on this ...
6 votes
0 answers
320 views
Reconstructing count table when only pairwise features are visible
Assume we are only able to observe two-way entry table counting the number of observations of a pair of categorical features $x_i,x_j$. $$ \begin{array}{c|ccc} & & x_j & \\ \hline ...
1 vote
0 answers
60 views
Neural networks - irregular time shifts of output compared to inputs in given time series data sets
I have some time series data with multiple features. The output is shifted (I mean the times at which I have the output values are shifted from the corresponding inputs and also irregularly). I have ...
1 vote
0 answers
41 views
Preprocess two different kind of datasets for a machine learning problem
I am working on two health-related datasets. And I use Python. One tabular dataset (called A) contains patient-level information (by id) and a bunch of other features which I have already transformed ...