Variable scope in R

Question

I have a data.frame named factor_nonagg with 50 rows and 3 columns. I wrote a function category() with argument factors. I am making changes to factors in the function. When I pass the data.frame to this function, no changes are made in the data.frame. Can someone help me in making permanent changes to my data.frame?

n=50 category=function(factors){ for(i in 1:n){ if(factors[i,1]>=90) factors[i,1]<-2*.45 else if(factors[i,1]>=65) factors[i,1]<-1*.45 else factors[i,1]<-0 if(factors[i,2]>=.190) factors[i,2]<-2*.25 else if(factors[i,2]>=.140) factors[i,2]<-1*.25 else factors[i,2]<-0 if(factors[i,3]>=.03) factors[i,3]<-2*.30 else if(factors[i,3]>=.015) factors[i,3]<-1*.30 else factors[i,3]<-0 }} category(factor_nonagg)

Several problems here, but the most important for your question is that you need to have return(factors) at the end of your function and if you want to use that value to overwrite your factor_nonagg object, you need to use <-: factor_nonagg <- category(factor_nonagg). — Thomas
– Thomas, Commented Jun 23, 2014 at 13:03

Thomas · Accepted Answer · 2014-06-23 13:32:51Z

R does not easily support pass-by-reference type behavior with functions. When you make a change to a parameter value within a function, a copy of the object is made and the changes last only as long as the function call.

Typically you have your function return the changed value (return(factor)), and assign that new value to the original variable:

factor_nonagg <- category(factor_nonagg)

IRTFM · Accepted Answer · 2014-06-23 22:29:57Z

Looping through rows of dataframes is going to be painfully slow. This is a vectorized approach that is admittedly untested in the absence of data but does not throw an error with the other test data offered by dardisco:

category=function(factors){ factors[[1]] <- 0.45*(0:2)[ findInterval(factors[[1]], c(-Inf, 65, 90, Inf) )] factors[[2]] <- 0.25*(0:2)[ findInterval(factors[[2]], c(-Inf, 0.140, 0.190, Inf) )] factors[[3]] <- 0.30*(0:2)[ findInterval(factors[[3]], c(-Inf, 0.015, 0.03, Inf) )] return(factors) }

And, of course, as with all functional languages, factor_agg would not be modified except with an assignment command:

category(factor_agg) # no effect factor_agg <- category(factor_agg) # replacement occurs.

findInterval is a very useful vector-oriented function that can either be used to return a grouping value or used, as in this example, as an index to select from a set of either character or numeric values

abel · Accepted Answer · 2014-06-23 13:05:20Z

0

You need to set an output object in your function that returns the changes you make to your df. This is achieved by adding

return(factors)

just before your last curly bracket in your function definition.

answered Jun 23, 2014 at 13:05

abel

5006 silver badges14 bronze badges

5 Comments

Konrad Rudolph Over a year ago

No need for (and indeed not in the spirit of R) to use return.

abel Over a year ago

@Konrad True, functions do not necessarily need an explicit return statement, because the last evaluated expression is taken as the output otherwise. However,the for-loop in this code returns NULL and therefore the whole function returns NULL if the output is not specified.

Konrad Rudolph Over a year ago

Just use factors then – no need for return(factors). Better yet, of course, would be to refactor the function so that no for is used.

Dason Over a year ago

@KonradRudolph I wouldn't go saying that it "isn't in the spirit of R" to not use an explicit return statement. It has been shown to be slightly faster to omit it but I prefer seeing the return statement. Code clarity is more important, in my opinion, than than unnoticeable speed boost. And while most R users will understand what is going on it makes it easier for others to read the code as well.

Konrad Rudolph Over a year ago

@Dason It’s got nothing to do with performance. It’s simply misinterpreting what functions and values are in functional programming (and R is a functional language), and thus on par with writing if (x == TRUE) instead of if (x).

dardisco · Accepted Answer · 2014-06-23 17:43:52Z

You could approach it like this:

set.seed(1) df1 <- data.frame( f1 = sample(seq(150), size=50, replace=TRUE), f2 = sample(seq(250) / 1000, size=50, replace=TRUE), f3 = sample(seq(50) / 1000, size=50, replace=TRUE) ) ### vals1 = values ### mult1 = multiplier fun1 <- function(x, vals1, mult1){ if (x >= max(vals1)) return(mult1*2) if (x >= min(vals1) & x < max(vals1)) return(mult1) return(0) } within(df1, f1 <- sapply(f1, fun1, vals1=c(90, 65), mult1=0.45), f2 <- sapply(f2, fun1, vals1=c(0.19, 0.14), mult1=0.25), f3 <- sapply(f3, fun1, vals1=c(0.03, 0.15), mult1=0.3) )

This avoids the for (although short loops are not necessarily a bad thing), saves on typing and allows it to be more easily generalized if you want to change the values or multiplier. I'm using return in fun1 as it has multiple exit points.

Collectives™ on Stack Overflow

Variable scope in R

4 Answers 4

Comments

Comments

5 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

5 Comments

Comments

Related