Name of algorithm that maps a string column to a float column, based on an aggregation with another float column , similar to TF-IDF

Question

The Question

I'm not super familiar with the name's of common algorithms in Data Science, and I feel like this would be something that is commonly used, and so should have a name - want to refer to its proper name for the sake of documenting it correctly in a codebase. I've implemented an algorithm that is -kind- of like TF-IDF (only similar algorithm I know by name), it runs on a dataset containing a string and a float column, here's how the algorithm works on an example table:

Input (str)	Output (float)
a	2.0
b	0.0
a	1.0
a	6.0
c	8.0
c	4.0

Step 1

group by Input, and take the mean of the output

Input (str)	Output Mean (float)
a	3.0
b	0.0
c	6.0

Step 2

Calculate the rank of the Inputs based on the order of their Output column

Input (str)	Rank (float)
a	2.0
b	1.0
c	3.0

Step 3

We then map the input strings to this new rank

Input (float)	Output (float)
2.0	2.0
1.0	0.0
2.0	1.0
2.0	6.0
3.0	8.0
3.0	4.0

Follow-up Question

Assuming the answer does not also answer this, what is this called for an arbitrary aggregation method, for example we median, or max instead of finding the mean in the first step.

It's a fairly common workflow, as you say, but I'm not aware of any special name for it. — Robert Long
– Robert Long, Commented Feb 11 at 12:11

rehaqds · Accepted Answer · 2025-02-12 23:12:25Z

It looks like "Target Encoding" which is a type of encoding used to transform categorical feature to numerical features using the average of the target values for each category.

But beware of overfitting due to data leakage if your "Output" column is the target for a ML/DL model.

Scikit-learn has one implementation, also category_encoders for example.

Stack Exchange Network

Name of algorithm that maps a string column to a float column, based on an aggregation with another float column , similar to TF-IDF

The Question

Step 1

Step 2

Step 3

Follow-up Question

1 Answer 1

Hot Network Questions

Name of algorithm that maps a string column to a float column, based on an aggregation with another float column , similar to TF-IDF

The Question

Step 1

Step 2

Step 3

Follow-up Question

1 Answer 1

Related

Hot Network Questions