3
$\begingroup$

Directed here from StackOverflow

Let's say I want to assess if there is a correlation between two fields, one of which I know to have a power distribution.

A lot of the information I read assumes normal distribution.

So how should I proceed if the distribution is not normal, and the correlation seems non-linear?

It seems visually that, that the correlation is non-linear.

If you'd like to see the data, it's available here: https://drive.google.com/file/d/1_CqquGevCafoCIRYbfk3lU9ZimEitNe1/view?usp=sharing

To collect the data:

c3.runStatsFull = read.csv("./0962d301-2a24-4cc9-ba3b-90759670979f_complete/RunStats.csv") worker <- read.csv("./999ba3af-ad49-4f1c-9627-14b1d4e2cce9_complete/RunStats.csv") c3.runStatsFull <- rbind(c3.runStatsFull, worker) worker <- read.csv("./6135f1e9-da7c-4180-aa53-3e170d50153d_complete/RunStats.csv") c3.runStatsFull <- rbind(c3.runStatsFull, worker) worker <- read.csv("./a3819f79-6ef1-4b4c-9d71-35a2fc380c3b_complete/RunStats.csv") c3.runStatsFull <- rbind(c3.runStatsFull, worker) worker <- read.csv("./db76feda-f5f5-4648-897d-de99027d5682_complete/RunStats.csv") c3.runStatsFull <- rbind(c3.runStatsFull, worker) 

The likely source of the power distribution, is that I am working with scale free networks, generated in the following way:

randomGraph <- barabasi.game(nodeCount, power = 1.2, m = 1, out.dist = NULL, out.seq = NULL, out.pref = FALSE, zero.appeal = 1, directed = FALSE, algorithm = "psumtree", start.graph = NULL) 

There seems to be a correlation, just not linear

I know this is power distributed

$\endgroup$
1
  • 3
    $\begingroup$ Try taking the log of processing time. That will likely be closer to normally distributed. The processing time variable might have a lognormal distribution. $\endgroup$ Commented Apr 12, 2020 at 7:18

2 Answers 2

3
$\begingroup$

Your data. To my eye, an important feature of your scatter plot is that the scatter about (what I suppose to be) the regression line is much greater at the right side of the plot than at the left. (In technical language the residuals show unequal variances.)

There is a clear association between the x and y variables, and an important component of that association is linear. I do not imagine that a simple nonlinear curve (say a parabola or third-degree polynomial) would fit the data a lot better than a straight line.

My simulated data. Here is an example with data simulated in R, showing an association that is not exclusively linear, even though the (Pearson) correlation $r \approx 0.976$ is very close to $1.$

set.seed(2020) x = 1:20; y = x + x^2 + rnorm(20, 0, 5) cor(x,y) [1] 0.9758755 plot(x, y, pch=20) curve(x + x^2, add=T, col="blue") reg.out = lm(y ~ x) abline(reg.out, col="green") 

enter image description here

Points in this plot follow the curve $y = x + x^2$ (blue), except for a small amount of random normal noise. [The regression line (green) is also shown.]

You may be interested in learning about Spearman correlation. It is found by taking the Pearson correlation of the ranks of the two variables. The Spearman correlation $r_S$ tends to disregard the curvature in the plot. In this example $r_S \approx .998 > r.$

cor(x, y, meth="s") [1] 0.9984962 # Spearman correlation cor(rank(x), rank(y)) [1] 0.9984962 # Method of computation via ranks 

Addendum following comment: Kendall's $tau = 0.998.$

cor(x,y, meth="k") [1] 0.9894737 
$\endgroup$
3
  • 1
    $\begingroup$ Thank you. This certainly sent me on the right path. I was getting a warning using Spearman, as there were tied ranks. Your answer prompted me to look into Kendall's rank correlation tau, which I believe is giving me the results I need. Details in subsequent comment. $\endgroup$ Commented Apr 12, 2020 at 20:05
  • $\begingroup$ Addendum to my answer shows Kendall's $\tau.$ Its computation is a bit more intricate. $\endgroup$ Commented Apr 12, 2020 at 20:20
  • $\begingroup$ Thx @BruceET. Indeed, Kendall's Tau is the route I eventually followed. Appreciate the help, (and the obvious effort you went into, in giving such a complete answer) $\endgroup$ Commented Apr 12, 2020 at 20:32
0
$\begingroup$

Thank you @Noah and @BruceET. I've combined your answers, along with further analysis here.

@BruceET, the bunching of the data was significant, and @Noah's suggestion of using Log on processingTime was very helpful in that regard.

Spearman was giving me a warning. Probably not significant, but it did make me nervous, and I didn't want to have to justify ignoring it.

cor.test(c3.runStatsFull$log.processingTime, c3.runStatsFull$closeness, method="spearman") Spearman's rank correlation rho data: c3.runStatsFull$log.processingTime and c3.runStatsFull$closeness S = 11385697, p-value < 2.2e-16 alternative hypothesis: true rho is not equal to 0 sample estimates: rho 0.4534844 Warning message: In cor.test.default(c3.runStatsFull$log.processingTime, c3.runStatsFull$closeness, : Cannot compute exact p-value with ties 

So I went with Kendall:

cor.test(c3.runStatsFull$processingTime, c3.runStatsFull$closeness, method="kendall") Kendall's rank correlation tau data: c3.runStatsFull$processingTime and c3.runStatsFull$closeness z = 10.481, p-value < 2.2e-16 alternative hypothesis: true tau is not equal to 0 sample estimates: tau 0.3146949 

(Incidentally, I get the same result, whether using "processingTime" or "log.processingTime")

Kendall, using processingTime Kendall, using log.processingTime

Now all this double-negative stuff regarding the null hypothesis melts my head, but assuming I'm interpreting this correctly…

The p-value (2.2e-16), being significantly below 0.05, indicates that there is significant evidence to reject the null hypothesis, that the data is not correlated, so the data are very consistent with there being a correlation, even if I have not proved it.

Please point out, if I'm saying something stupid. :-)

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.