Are there clustering algorithms or preprocessing strategies tailored for zero-inflated and continuous data types?

Question

I am currently working on the project where I need to assign customers across N recipes before AB testing such that KPIs for each customer are balanced across recipes (reduce pre-test bias)

Dataset contains 4 KPIs for each unique customer.

All 4 KPIs are highly right skewed (non-negative continuous variable and are on different scales) Skewness of 'KPI1': 1.76 Skewness of 'KPI2': 12.67 Skewness of 'KPI3': 25.17 Skewness of 'KPI4': 3.98

Zeros are the real zero not any errors. Percentage of zero values in 'KPI3': around 2% Percentage of zero values in 'KPI4': >50% Percentage of zero values in 'KPI1': around 2%

Means of KPIS KPI1 0.31970 KPI2 2189.389833 KPI3 7368.538885 KPI4 0.136795

Data has pretty good amount of outliers looking at skewness

Most of the data is concentrated near origin (KPI1 and KPI4).

I was thinking to cluster customer with similar KPIs and from each cluster want to assign to n_recipes in round robin to get balance KPI balanced between recipes. but unable to get optimal cluster from k-means

I am pretty confused to as what data preprocessing can I use whether log,power transformer or quantile transformer and for scaling standard or robust scaling.

I also thought of using simple multi-strata stratified sampling with round robin. but the challenge is choosing optimal q value for each KPI (feature). Keeping fixed q for all in not resulting in good balance. Should I perform scaling on all KPIs then fix a q value?

TL;DR: What are the most effective approach (Multi-strata stratified sampling or clustering technique) and preprocessing strategies for datasets containing both zero-inflated and continuous variables? Should I consider transforming the variables, switching to a different clustering method, or applying both approaches to improve performance?

Is it possible to define meaningful clusters by subject matter knowledge of the meaning of the variables and how the clusters will be used, rather than asking for an automatic method to build clusters from the data? As far as I can see, here you don't need clusters to fulfill any clustering optimality criterion, rather you want to use clustering to balance the KPIs in your later study. So it's probably better to choose clusters "manually" in order to achieve this rather than looking for clustering methods that try to do something else. — Christian Hennig
– Christian Hennig, Commented Sep 26 at 8:49
Hey @ChristianHennig thanks for the comment. Those variables are Avg order value(AOV), Conversion/visits, carttoquote/visits, revenue/visits(RPV) — Rishab
– Rishab, Commented Sep 26 at 15:01

Stack Exchange Network

Are there clustering algorithms or preprocessing strategies tailored for zero-inflated and continuous data types?

0

Hot Network Questions

Are there clustering algorithms or preprocessing strategies tailored for zero-inflated and continuous data types?

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Related

Hot Network Questions