Questions tagged [preprocessing]
Data preprocessing is a data mining technique that involves transforming raw data into a better understandable or more useful format.
539 questions
0 votes
1 answer
44 views
Correlated Features In Classificatification Problem
I'm working on binary classification problem to identify struggling students in university. I have some features that are correlated such as high_school_grade_1 that represents 75% of ...
4 votes
0 answers
29 views
Time-efficient parallelization of masks for pre-processing a dataset
I have a large dataset (~10M points) in python and I want to filter it using a large number of different custom masks, as part of calculations to create a new but related dataset. Because the dataset ...
7 votes
1 answer
97 views
Effects of resizing training images during preprocessing CNN classification model
I'm trying to train a CNN model to identify phytoplankton species from a training set. During preprocessing, the images are resized to 224x224, which seems to be stretching or compressing the object ...
1 vote
1 answer
67 views
Is it valid to filter features using t-tests before train/test split in high-dimensional biological data
I'm working with high-dimensional biological data (∼41,000 features × 3,979 samples from RNA-seq for 2 conditions). Here’s a simplified version of my preprocessing and filtering pipeline before ...
7 votes
1 answer
96 views
Difference between transform('min) vs min() in pandas
I am currently working on a dataset that has two columns: customerID and date. I want to find the minimum date for each customerID. Initially, I used the following code: ...
1 vote
0 answers
29 views
How can I efficiently process and load a large Protobuf dataset for machine learning model training?
I am training a model on multiple cache miss examples from various trace simulations. For every trace I have thousands of miss examples stored and I have many traces. I'm storing the examples in ...
0 votes
0 answers
30 views
String to number in case of having millions of unique values
I am currently working on preprocessing big data dataset for ML purposes. I am struggling with encoding strings as numbers. I have a dataset of multiple blockchain transactions and I have addresses of ...
1 vote
1 answer
40 views
How to binning/tokenizing amplitude of stationary timeseries?
I want to feed the amplitude of stationary timeseries into transformer. I'm planning to tokenize/bin the amplitude into discrete value. So, the transformer learn from unique integer token instead of ...