How to fix intersection of cluster distributions in R

Question

I need help with a clustering task I'm doing. The essence of the problem, there is data on vegetation indices. Simple example for R

clu=structure(list(ndvi_mr75_60_40 = c(0.97, 0.97, 0.8, 0.87, 0.84, 0.83, 0.87, 0.78, 0.99, 0.87, 0.85, 0.91, 0.89, 0.75, 0.97, 0.82, 0.97, 0.99, 0.93, 0.94), ndre_m75_60p20 = c(0.4, 0.42, 0.52, 0.55, 0.37, 0.32, 0.46, 0.5, 0.35, 0.33, 0.37, 0.47, 0.44, 0.43, 0.38, 0.47, 0.44, 0.53, 0.29, 0.51), ndwi_m75_60p20 = c(0.24, 0.26, 0.35, 0.3, 0.31, 0.27, 0.3, 0.28, 0.09, 0.08, 0.21, 0.27, 0.22, 0.31, 0.12, 0.28, 0.2, 0.27, 0.09, 0.29), arvi_m75_60p20 = c(0.58, 0.58, 0.79, 0.75, 0.47, 0.43, 0.7, 0.68, 0.57, 0.45, 0.52, 0.6, 0.68, 0.65, 0.52, 0.61, 0.62, 0.7, 0.37, 0.72), evi_m75_60p20 = c(0.45, 0.44, 0.6, 0.64, 0.39, 0.33, 0.56, 0.55, 0.41, 0.33, 0.39, 0.48, 0.53, 0.51, 0.4, 0.49, 0.49, 0.59, 0.27, 0.61), evi_mr75_p20 = c(0.38, 0.38, 0.55, 0.4, 0.41, 0.3, 0.36, 0.39, 0.55, 0.51, 0.52, 0.37, 0.55, 0.44, 0.45, 0.39, 0.4, 0.4, 0.54, 0.38), wri_m75_60p20 = c(0.47, 0.51, 0.29, 0.31, 0.8, 0.68, 0.5, 0.41, 0.38, 0.52, 0.45, 0.36, 0.4, 0.39, 0.4, 0.39, 0.34, 0.29, 0.71, 0.31), wri_mr75_45p10 = c(0.55, 0.58, 0.39, 0.33, 0.94, 0.79, 0.65, 0.59, 0.68, 0.91, 0.53, 0.56, 0.57, 0.42, 0.63, 0.48, 0.54, 0.4, 0.81, 0.53), wri_mr75_20 = c(0.74, 0.77, 0.39, 0.32, 0.97, 0.82, 0.77, 0.54, 0.61, 0.98, 0.47, 0.59, 0.52, 0.36, 0.65, 0.38, 0.55, 0.36, 0.92, 0.45), ndvi_s85_50 = c(48.51, 47.65, 45.27, 52.05, 37.47, 26.14, 47.43, 45.54, 57.16, 44.9, 47.7, 46.19, 57.25, 44.47, 60.44, 43.22, 57.02, 64.49, 49.04, 56.35), cluster = c(1L, 1L, 1L, 1L, 3L, 5L, 1L, 1L, 4L, 3L, 1L, 1L, 4L, 3L, 4L, 3L, 4L, 4L, 1L, 4L)), class = "data.frame", row.names = c(NA, -20L))

I used kmean .Сluster is number cluster for obs. Next for each cluster there data for yield. there example

yield=structure(list(cluster = c(1L, 1L, 1L, 1L, 3L, 5L, 1L, 1L, 4L, 3L, 1L, 1L, 4L, 3L, 4L, 3L, 4L, 4L, 1L, 4L), yield = c(2260L, 2016L, 2777L, 1701L, 2202L, 2260L, 1254L, 2103L, 2942L, 1318L, 1633L, 2190L, 2270L, 2767L, 1463L, 2190L, 1773L, 2280L, 1855L, 1670L)), class = "data.frame", row.names = c(NA, -20L))

having this, i can get histogram here for each of six cluster provided histogram of yield. As you can see, the yield histograms between clusters overlap very strongly. it is necessary to obtain a uniform distribution of probabilities for yields.

In other words, 1.we have data on vegetation indices and yields as input, only yields do not need to be clustered, only vegetation indices are clustered

2.After we have clustered the vegetation indices and obtained the cluster number, we add a column with yield, rows of vegetation indices and yield values correspond to

then get histogram of yield for each cluster, like i provided. Indeed each row is ID of garden bed.(that is, vegetation index data for this bed and its yield)

Therefore, the main question for which I created this topic is how to normalize the data on vegetation indices in such a way that after receiving clusters, the yield between clusters does not overlap, i.e. achieve a uniform distribution of probabilities like this (artificially painted)

Or what is the best method to choose to achieve the "correct result"

Any help is valuable to me

dipetkov · Accepted Answer · 2022-04-18 00:10:27Z

By "correct result" you mean "desired result".

You want to cluster data by vegetation indices and you expect that the resulting clusters are well separated in terms of yield. But is this expectation justified?

If you plot a histogram/density of yield (not by cluster), do you see 6 different peaks as in your fake plot? I'd guess no because the plot of yield by cluster clearly shows that yield is evenly distributed between 1500 and 3000 with a small peak at around 2000. I'd also guess that the pink cluster is different because it is very small. [You lose information about sample size when you normalize a histogram into a density.]

Finally, how did you decide on six clusters? You could experiment with a smaller k but you should still expect to see a lot of overlap in the yield distributions.

Peter Flom · Accepted Answer · 2024-07-09 10:54:38Z

I don't see why you are using cluster analysis (CA). CA is unsupervised learning. Its goal is to find ways that observations "go together" or cluster, with no dependent variable. So, it is not surprising that your clusters don't vary on yield.

If you want to figure out what variables are related to yield, you could use some form of regression, or perhaps trees or forests or neural nets or some other form of supervised learning.

Stack Exchange Network

How to fix intersection of cluster distributions in R

2 Answers 2

Hot Network Questions

How to fix intersection of cluster distributions in R

2 Answers 2

Related

Hot Network Questions