2
$\begingroup$

I have 3 models and each model is solving tasks (say task 1 to 2).

Once these tasks (of same type) are solved by the models; I am collecting 3 numerical features (say feature1 to feature3) for each task for each model.

Model A: Feature1-task1-modelA = 20 Feature2-task1-modelA = 40 Feature3-task1-modelA = 55 Feature1-task2-modelA = 77 Feature2-task2-modelA = 30 Feature3-task2-modelA = 22 Model B: Feature1-task1-modelB = 10 Feature2-task1-modelB = 70 Feature3-task1-modelB = 33 Feature1-task2-modelB = 88 Feature2-task2-modelB = 79 Feature3-task2-modelB = 97 Model C: Feature1-task1-modelC = 45 Feature2-task1-modelC = 65 Feature3-task1-modelC = 75 Feature1-task2-modelC = 30 Feature2-task2-modelC = 40 Feature3-task2-modelC = 99 

These features eventually will be used for a classification problem to determine which model will be selected for solving these tasks.

I am in the process of feature selection where I am trying select only top features that will be beneficial for the model selection.

My thinking is to calculate these top features using Chi-Square and p-value. Similar to the following:

Feature Chi2 Score p-value feature2 3.89 0.1427 feature3 2.70 0.2592 feature1 2.41 0.2992 

So here if I am selecting top 2 features then I will be using only feature2 and feature3 in the classification problem.

My question is: How I can I aggregate these features values from the different tasks and then select top features?

I could be wrong in my overall approach. How can I do this? Are there any other ideas to select top features?

*Note: Don't bother with the numbers since all used above are dummy ones.

$\endgroup$

1 Answer 1

1
$\begingroup$

First, let's break the problem:

First, What do we have?

  • Models (which are "classes"): A, B, and C. This is the target variable for the classification problem.
  • Tasks: Task 1 and Task 2. These are essentially different "data points" or "samples" for each model.
  • Features: Feature 1, Feature 2, and Feature 3. These are the numerical measured attributes.

Second, what is the goal?

The goal is to build a classifier that takes the features of a new, unseen task and predicts which model (A, B, or C) would be the best fit.

Third, how can we do it?

Two options you can do which are:

Option 1: Statistical Feature Selection

Now, as per the question - the solution thoughts were to select features Statistically.

Nothing wrong with that, but you need to select the statistical methodology based on your data analysis where you need to build a hypothesis, calculate its correlation with other features, run classification, compare accuracy/precision/recall/f1-scores.

It is ongoing cycle where you might select different features/algorithms/models/evaluations based on your needs. Consideration such as features information loss tolerance, data type, data size, performance, or any other constraint must be studied within the hypothesis to make a rational decision on which best statistical algorithm to use.

For example, Chi-square test is typically used with categorical data. It measures the association between two categorical variables. While you can adapt Chi-Square for numerical data by binning it (i.e., converting it to categorical data), this can be tricky and may lead to a loss of information. Therefore, you have decided to select other algorithm based on xyz because of xyz.

As for aggregating the data, you need to aggregate per features since your ultimate goal is to select best-fit features for the classification purpose.

Here's how you can structure your data for feature selection:

A. Visualize your overall data.

 Sample Model (Target) Feature 1 Feature 2 Feature 3 Task 1 Model A 20 40 55 Task 2 Model A 77 30 22 Task 1 Model B 10 70 33 Task 2 Model B 88 79 97 Task 1 Model C 45 65 75 Task 2 Model C 30 40 99 

B. Aggregate your data based on features.

Assuming you are analyzing Feature1:

 Values for Model A: [20, 77] Values for Model B: [10, 88] Values for Model C: [45, 30] 

C. Calculate feature scores across all metrics.

Example: Giving x rational and instead of using Chi-Square, I will be using F-score and p-value.

 Feature F-Score p-value Feature2 5.2 0.045 Feature3 2.1 0.210 Feature1 0.8 0.505 

In this example, Feature 2 would be selected as the top feature because its p-value and F-score fits the criteria assigned earlier.

Option 2: Feature Selection using Machine Learning Models

In this option, you can use Model-Based Feature Importance or wrapper methods such as Recursive Feature Elimination.

The idea is not so far from option1. You select a classifier, train it, get importance scores, eliminate features, select best features, and repeat.

You can use Random Forest/Gradient Boosting Machines/Decision Tree.

Again, your selection here is solely depends on your data analysis, accuracy, and any constraint on your way.

$\endgroup$
2

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.