Preserve relations between data points when preprocessing

Question

I am tasked with a project that aims to predict the probability of a product being returned before the product is even ordered.

I have an excel containing a bunch of orders. In order to make predictions, it is important to predict each item in the context of the order basket, e.g. if a person orders 5 pairs of shoes, he will most likely be returning 4 of the 5 pairs.

I can link purchases of different products via their order ID. Question is, how am I supposed to preprocess these IDs to preserve their relationship. Do I just encode them like a categorical attribute?

amol goel · Accepted Answer · 2022-10-05 15:36:55Z

Yes. There are 2 options for encoding order:

One Hot Encoding : Useful when we want to treat distinct values of OrderID as features. Since order ID does not mean anything. This encoding is not recommended.
Simple Numerical Encoding like 1,2 ,3 : This will remove any meaning in orderID. But we are not concerned about that. So this method is suitable.

Now, problem comes in selecting model. The order ID has association with other features such as item category. For eg: If there are 4 shoes in same order_id, then chances of return are high. You can not use logistic classification model here:

features should not be correlated
features should be linearly related to independent variable

Classifiers that do not have above limitations are : Decision Tree, Random Forest. Start with Decision Tree with Pruning and then to RF. We can discuss modelling part later.

Daniel Warfield · Accepted Answer · 2025-10-15 14:56:32Z

You could think of this as a graph, which might be kind of fun.

Let's say you have 100,000 products in your entire corpus. Each could be a node in a fully connected weighted graph. The weight could be the frequency in which two things are ordered together.

This would be a pretty large graph, but you could merge nodes based on proximity to bin products based on their weighted proximity. This would allow you to bin these products into reasonable groups, perhaps dropping the number of features in your space from 100,000 to 100. For our purposes, though, let's say you collapse all of your 100,000 products into 10 groups.

You could then encode a cart as the number of occurrences in each group you have. For instance, Shoes might chiefly exist in the first 3/10 groups. So the cart might look something like.

[2,1,1,0,0]

You could then do whatever you want with this data. With a lower number of dimensions a tree might be appropriate. With a larger number of dimensions some deep learning approach might be appropriate. Sky's the limit. You could also experiment with different relationships to create the graph, like how often are things ordered by the same person.

I feel like this approach addresses your question because you're encoding relationships in two ways:

by binning based on proximity
by encoding relationships spatially within the input vector

Stack Exchange Network

Preserve relations between data points when preprocessing

2 Answers 2

Hot Network Questions

Preserve relations between data points when preprocessing

2 Answers 2

Related

Hot Network Questions