I am currently working on the project where I need to assign customers across N recipes before AB testing such that KPIs for each customer are balanced across recipes (reduce pre-test bias)
Dataset contains 4 KPIs for each unique customer.
All 4 KPIs are highly right skewed (non-negative continuous variable and are on different scales) Skewness of 'KPI1': 1.76 Skewness of 'KPI2': 12.67 Skewness of 'KPI3': 25.17 Skewness of 'KPI4': 3.98
Zeros are the real zero not any errors. Percentage of zero values in 'KPI3': around 2% Percentage of zero values in 'KPI4': >50% Percentage of zero values in 'KPI1': around 2%
Means of KPIS KPI1 0.31970 KPI2 2189.389833 KPI3 7368.538885 KPI4 0.136795
Data has pretty good amount of outliers looking at skewness
Most of the data is concentrated near origin (KPI1 and KPI4).
I was thinking to cluster customer with similar KPIs and from each cluster want to assign to n_recipes in round robin to get balance KPI balanced between recipes. but unable to get optimal cluster from k-means
I am pretty confused to as what data preprocessing can I use whether log,power transformer or quantile transformer and for scaling standard or robust scaling.
I also thought of using simple multi-strata stratified sampling with round robin. but the challenge is choosing optimal q value for each KPI (feature). Keeping fixed q for all in not resulting in good balance. Should I perform scaling on all KPIs then fix a q value?
TL;DR: What are the most effective approach (Multi-strata stratified sampling or clustering technique) and preprocessing strategies for datasets containing both zero-inflated and continuous variables? Should I consider transforming the variables, switching to a different clustering method, or applying both approaches to improve performance?