I'm performing IDW interpolation using the idw R function from the gstat package. For the purpose of determining what combination of "idp" and "nmax" produces the lowest RMSE in leave-one-out-cross validation, how can I specify the "idp" and the "nmax" that are going to be use in the krige.cv function?
2 Answers
I've had the same problem and just found a solution: In the gstat-documentation on page 21 they use for defining idp the expression set = list(idp = 0.5). For "nmax" it's as usual:
meuse.idw_cv <- krige.cv(log(zinc)~1, meuse, nmax = 7, set = list(idp = .5)) Here is a worked out CV example using the meuse dataset:
# Load Libraries library(data.table) library(ggplot2) library(viridisLite) library(sp) library(gstat) # Settings seed=123L # Convert data to SpatialPointsDataFrame data(meuse) coordinates(meuse)= ~x+y # Log transform meuse$logZinc=log(meuse$zinc) # Save data.frame meuse.df=as.data.table(meuse) # View data spplot(meuse['logZinc'],,auto.key=T,key.space='right',scales=list(draw=T)) # Initialize cv variables nfold=10L pmax=8L nTimes=5L cv=expand.grid(B=1:nTimes,p=0:pmax,RMSE=NA_real_) setDT(cv) setorder(cv,B,p) dim(cv) # 180 xx fit.cv=vector(mode='list',length=nrow(cv)) # B-times k-fold CV i=1L set.seed(seed) for(i in 1:nrow(cv)){ fit.cv[[i]]=krige.cv(logZinc~1,locations=meuse,nfold=nfold,set=list(idp=cv[i,p])) head(fit.cv[[i]]) spplot(fit.cv[[i]]['var1.pred'],auto.key=T,key.space='right',scales=list(draw=T)) cv[i,RMSE:=sqrt(mean(fit.cv[[i]]$residual^2))] } # END Loop # cv # Aggregate across B times cvAvg=cv[,.(RMSE=mean(RMSE)), by=p] # cvAvg # Plot CV results par(mar=c(4.5,4.5,1,1)) plot(RMSE~p,data=cv,type='p') lines(RMSE~p,data=cvAvg,lwd=3,col=4) # Set optimum p which.min(cvAvg$RMSE) # 4 p=3 # Create gridded newdata data(meuse.grid) newdata=meuse.grid coordinates(newdata)= ~x+y # Fit IDW to newdata rm(fit) fit=idw(logZinc~1,locations=meuse,newdata=newdata,idp=p) # fit=krige(logZinc~1,locations=meuse,newdata=newdata,set=list(idp=p)) # Equivalent fit.df=as.data.table(fit) # Plot - Observed ggplot(meuse.df,aes(x,y,fill=zinc))+ geom_point(size=3,shape=24)+ coord_equal()+ scale_fill_viridis_c()+ labs(y='Y',x='X',title='Observed Data',fill='Zinc, mg/L')+ theme_classic()+ theme(legend.position='bottom') # Plot - Fitted ggplot(fit.df,aes(x,y,fill=var1.pred))+ geom_raster()+ geom_point(data=meuse.df,aes(y=coordinates(meuse)[,'y'],x=coordinates(meuse)[,'x']),inherit.aes=F,size=3,shape=24,fill='orange')+ # Observed coord_equal()+ scale_fill_viridis_c()+ labs(y='Y',x='X',title=sprintf('IDW Fitted (p=%d)',p),fill='Zinc, mg/L')+ theme_classic()+ theme(legend.position='bottom') The code applies 10-fold CV (the OP requested LOOCV) to select the ideal power value. The CV is averaged across 5 instances to account for potential re-sampling anomalies due to the random seed. Note that krige.cv creates folds using simple random sampling. A more customized re-sampling by strata approach may be more appropriate for your data.