0
$\begingroup$

I am currently working on clustering continuous variables (such as AOV, RPV, and conversions(conversion/visits)). The variables are heavily right skewed with long tails and one variable is dominated by zeroes meaning with more than 50% of values zeroes. And overall most of my data is concentrated near origin. The variables are also on different scales. Traditional clustering like k means is not performing well as data is clearly not spherical to cluster using k means.

I need suggestions for how to proceed with optimal clustering approach, data transformation and handle zero inflated data where cluster numbers are not pre-defined but rather are dynamic and adjust as per the data

$\endgroup$
3
  • $\begingroup$ Welcome to CV Please spell out AOV and RPV. Also, please tell us what you want the clusters to be like. That is, how do you want these skewed variables to be treated? What will you do with the clusters? What do you mean by "optimal"? $\endgroup$ Commented Sep 24 at 16:52
  • $\begingroup$ I edited your question to make it clearer and more grammatical. Please check that I did not change what you intended to ask. $\endgroup$ Commented Sep 24 at 16:56
  • $\begingroup$ This will depend on what in your specific application are the relevant characteristics of a cluster. There is software for mixtures of skew normal and skew t-distributions, however these may have difficulties with a large percentage of zero values. One consideration is whether having a zero on such a variable is distinctive enough a feature of observations that you may want to have these separated by clustering from the others. Another consideration is whether a transformation will do something good. $\endgroup$ Commented Sep 24 at 17:24

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.