0

I am looking for a way to score model fits of x,y data on a scale from 0 to 1, depending on which model the data best fit. If data fit a linear model best, the score would come out close to 1. Log would be close to 0, and quadratic would be intermediate (0.5, for example).

My problem is that the function below does not differentiate between linear and quadratic models - any ideas on how to fix this? I might be overlooking something simple or going about this the wrong way. I am using R. Sample code and function below;

`# Functions sat <- function(x, alpha) (1-alpha)*x + alpha*log(x) lik.sat <- function(alpha, x, y) as.numeric(logLik(lm(y ~ sat(x,alpha)))) est.sat <- function(x, y){ fit <- optim(par=.5, fn=lik.sat, method="Brent", control=list(fnscale=-1), lower=0, upper=1, x=x, y=y) if(fit$convergence==0) return(fit$par) return(NA) } # Simulate data #Linear model x<- abs(rnorm(100)) lin.y <- x + rnorm(100, sd=.25) #quad model quad.y <- x + x^2 + rnorm(100, sd=.25) #log model log.y <- log(x) + rnorm(100, sd=.25) # Fit data lik.sat(0, x, lin.y) lik.sat(1, x, lin.y) lik.sat(0, x, quad.y) lik.sat(1, x, quad.y) lik.sat(0, x, log.y) lik.sat(1, x, log.y) # Estimate transformation est.sat(x, lin.y) ## close to 1 est.sat(x, quad.y) #### DOESN'T WORK! est.sat(x, log.y) ## close to 0` 

1 Answer 1

0

This seems problematic because quadratic has 3 coefficients and the others have 2. Instead of trying to to use a continuous parameter just test each of the 3 models to see which gives the lowest AIC. In each of the three examples below it chooses the correct form.

set.seed(123) x<- abs(rnorm(100)) lin.y <- x + rnorm(100, sd=.25) quad.y <- x + x^2 + rnorm(100, sd=.25) log.y <- log(x) + rnorm(100, sd=.25) fos <- list(log = log.y ~ log(x), lin = log.y ~ x, quad = log.y ~ poly(x, 2)) sapply(fos, function(fo) AIC(lm(fo))) # log has lowest AIC ## log lin quad ## 17.15268 203.86473 136.72457 fos <- list(log = lin.y ~ log(x), lin = lin.y ~ x, quad = lin.y ~ poly(x, 2)) sapply(fos, function(fo) AIC(lm(fo))) # lin has lowest AIC ## log lin quad ## 81.577530 3.408632 5.407171 fos <- list(log = quad.y ~ log(x), lin = quad.y ~ x, quad = quad.y ~ poly(x, 2)) sapply(fos, function(fo) AIC(lm(fo))) # quad has lowest AIC ## log lin quad ## 313.2159 110.9217 1.0147 

Note that this can be made automatic and is superior to using nls since lm is much more stable.

We have returned the name of the chosen model but it could readily be modified to return the formula or the lm object depending on what is wanted. In those cases call it using lapply and Map instead of sapply and mapply.

opt <- function(x, y) { nm <- if (length(unique(x)) < 3) { "lin" } else { fos <- list(log = y ~ log(x), lin = y ~ x, quad = y ~ poly(x, 2)) s <- sapply(fos, function(fo) AIC(lm(fo))) names(s)[which.min(s)] } nm # optionally uncomment 1 of next 2 lines # fos[[nm]] # lm(fos[[nm]]) } sapply(list(A = log.y, B = lin.y, C = quad.y), opt, x = x) ## A B C ## "log" "lin" "quad" 

Another example. This one uses the built-in anscombe data set.

mapply(opt, anscombe[1:4], anscombe[5:8]) ## x1 x2 x3 x4 ## "log" "quad" "lin" "lin" 
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for your answer. I wanted to use a scale function because I am analysing many datasets (thousands) for best fit and wanted a simple way to visualise differences in trends across categories of datasets. AIC definitely works, but I would like to use a sliding 0-1 scale rather than using grouping model data into log, linear, and quadratic categories (if possible).
I have added some code that shows that this can be made completely automatic.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.