how to balance the data set with different number of observations

Question

I am pretty new to ds field and recently I was working on a self project of the clustering model. My goal is to create clusters and see in each cluster what the common features are between customers. The data set contains customer's information, when and what product they order and more. Below is a sample of the data:

id gender age order_dt ship_dt product 1 male 23 1/2/2018 1/9/2018 a 1 male 23 1/5/2018 1/6/2018 b 2 female 45 1/10/2018 1/20/2018 c 3 female 30 1/1/2018 1/2/2018 a 3 female 30 1/15/2018 1/20/2018 c 3 female 30 1/21/2018 1/21/2018 b 3 female 30 1/29/2018 2/1/2018 a

However, each id could contribute to a different number of records in the data set. Some could have a lot of records because they order many times while some only order once. I have googled around for unbalanced data, but most of them are talking about unbalancing in one category(feature maybe?) instead of the unbalanced number of observation. Should I aggregate on each id so that each id would only have 1 record in the data set or if there is any technique to handle data like this?

Thanks in advance,

mahesh ghanta · Accepted Answer · 2019-06-12 09:38:36Z

Clustering the data depends on what your objective is

In this case I am assuming you would want to see similarity among customers.

This would mean that the data should be unique at a customer level. This will help find similarities in some customer attributes.

For ex Number of transactions Gender Age group Weekend or weekday txns etc

Hence you would need to first aggregate your data at a user level and create features of interest like the examples above and the try to cluster the customers.

Stack Exchange Network

how to balance the data set with different number of observations

1 Answer 1

Hot Network Questions

how to balance the data set with different number of observations

1 Answer 1

Related

Hot Network Questions