3
$\begingroup$

The Question

I'm not super familiar with the name's of common algorithms in Data Science, and I feel like this would be something that is commonly used, and so should have a name - want to refer to its proper name for the sake of documenting it correctly in a codebase. I've implemented an algorithm that is -kind- of like TF-IDF (only similar algorithm I know by name), it runs on a dataset containing a string and a float column, here's how the algorithm works on an example table:

Input (str) Output (float)
a 2.0
b 0.0
a 1.0
a 6.0
c 8.0
c 4.0

Step 1

group by Input, and take the mean of the output

Input (str) Output Mean (float)
a 3.0
b 0.0
c 6.0

Step 2

Calculate the rank of the Inputs based on the order of their Output column

Input (str) Rank (float)
a 2.0
b 1.0
c 3.0

Step 3

We then map the input strings to this new rank

Input (float) Output (float)
2.0 2.0
1.0 0.0
2.0 1.0
2.0 6.0
3.0 8.0
3.0 4.0

Follow-up Question

Assuming the answer does not also answer this, what is this called for an arbitrary aggregation method, for example we median, or max instead of finding the mean in the first step.

$\endgroup$
1
  • $\begingroup$ It's a fairly common workflow, as you say, but I'm not aware of any special name for it. $\endgroup$ Commented Feb 11 at 12:11

1 Answer 1

2
$\begingroup$

It looks like "Target Encoding" which is a type of encoding used to transform categorical feature to numerical features using the average of the target values for each category.

But beware of overfitting due to data leakage if your "Output" column is the target for a ML/DL model.

Scikit-learn has one implementation, also category_encoders for example.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.