Questions tagged [data-preprocessing]

Question 1

I have a dataset with ~20.000 entries containing mean values for different groups. The groups are defined with 4 categorical columns and I have the week number, the number of samples per week and the ...

Question 2

The cross-validation function cv.glmnet, for regularized regression, does not seem to allow for separate transformation/preprocessing of training and validation ...

Question 3

I'm working on a classification problem where the goal is to maximize the F1-score, hopefully above 80%. Despite a very thorough EDA and preprocessing workflow, I've hit a hard performance ceiling ...

Question 4

I am currently conducting an online survey in a factorial setting ("vignette study"). I have 8 vignettes in total, varying in three dimensions (let us call them Dimension A, Dimension B and ...

Question 5

I am currently working on the project where I need to assign customers across N recipes before AB testing such that KPIs for each customer are balanced across recipes (reduce pre-test bias) Dataset ...

Question 6

Conceptually, I understand that models should be built totally blind to the test set in order to most faithfully estimate performance on future data. However, I'm struggling to understand the extent ...

Question 7

I am doing an analysis on NIR spectra of which I am trying to measure a physical property which I mostly expect to be scatter. However my samples have a complex surface morphology and I need some ...

Question 8

I've recently learnt unsupervised learning methods such as KMeans and DBSCAN. While working on this dataset, I applied KMeans clustering but faced the following issues: The Elbow Method showed no ...

Question 9

I'm applying K-Means clustering to a dataset of ship voyages. The goal is to group voyages into performance-based clusters like cost-efficient, underperforming, etc. I have 12 features in total: 10 ...

Question 10

I am working with some data in which the output target values $(Y)$ are all strictly positive values, essentially in the range of 0.001 to 100. Since these values can inherently never be negative or ...

Question 11

I've been reading into different Gaussian processes recently to better fit some data that I'm working with. My data clearly does not follow a multivariate Gaussian as required for a standard exact ...

Question 12

Can outlier removal be done only on one class in a binary classification problem? when facing with class imbalance for example, can it be done only on majority class? if so, is there any paper on this ...

Question 13

Assume we are only able to observe two-way entry table counting the number of observations of a pair of categorical features $x_i,x_j$. $$ \begin{array}{c|ccc} & & x_j & \\ \hline ...

Question 14

I have some time series data with multiple features. The output is shifted (I mean the times at which I have the output values are shifted from the corresponding inputs and also irregularly). I have ...

Question 15

I am working on two health-related datasets. And I use Python. One tabular dataset (called A) contains patient-level information (by id) and a bunch of other features which I have already transformed ...

Stack Exchange Network

Questions tagged [data-preprocessing]

Outlier detection in many short time series

How to separate transformation/preprocessing of training and validation datasets in glmnet? [closed]

Why are all my tuned models (DT, GB, SVM) plateauing at ~70% F1 after rigorous data cleaning and feature engineering?

Fitting mixed effect model to factorial survey data

Are there clustering algorithms or preprocessing strategies tailored for zero-inflated and continuous data types?

When and how can unsupervised preprocessing before splitting data lead to overoptimistic model performance?

NIR spectra preprocessing - two point linear baseline correction -OPLS

"How to validate if a dataset has natural clusters?"

How can I apply KMeans clustering if all variables are highly uncorrelated

Large errors with log-transformed Gaussian process regression?

Modifying Gaussian Processes and/or using transformations for dealing with positive-only output values? [closed]

Outlier Removal from only One Class in a binary classification problem

Reconstructing count table when only pairwise features are visible

Neural networks - irregular time shifts of output compared to inputs in given time series data sets

Preprocess two different kind of datasets for a machine learning problem

Hot Network Questions