0

consider for example the "iris" dataframe which is installed with main setup of R :

names(iris) # [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species" levels(iris$Species) # [1] "setosa" "versicolor" "virginica" 

now I construct three models without attaching the "iris":

t1=lm(iris$Sepal.Length ~ iris$Sepal.Width + iris$Petal.Length , data=iris) t2=lm(iris$Sepal.Length ~ iris$Sepal.Width + iris$Petal.Length , data=iris[iris$Species=="setosa",]) t3=lm(iris$Sepal.Length ~ iris$Sepal.Width + iris$Petal.Length , data=iris , subset = (iris$Species=="setosa")) 

now i think t2=t3<>t1 but R says t1=t2<>t3. why I'm wrong?!!

now I construct again my models but this time with attaching the "iris":

attach(iris) t1=lm(Sepal.Length ~ Sepal.Width + Petal.Length , data=iris) t2=lm(Sepal.Length ~ Sepal.Width + Petal.Length , data=iris[iris$Species=="setosa",]) t3=lm(Sepal.Length ~ Sepal.Width + Petal.Length , data=iris , subset = (iris$Species=="setosa")) 

now me and R both think: t2=t3<>t1. but again I'm confused because of the effect of attaching on model! I think first set of models is equivalent to second set of models, but R says no! thanks.

12
  • 4
    First note: you don't need to put iris$ before each variable name when a function uses data=iris. e.g., tm1=lm(Sepal.Length ~ Sepal.Width + Petal.Length, data=iris) works. Commented Jun 25, 2016 at 19:52
  • 1
    Second, I think it is better to create subsets outside out of the model statements. e.g., setosa <- subset(iris, subset = Species == "setosa") then you can run the linear model on the subset. Commented Jun 25, 2016 at 19:54
  • so is putting iris$ before each variable have some effects on models? @Twitch_City Commented Jun 25, 2016 at 20:03
  • 1
    Also, my other recommendation is to never use attach(), it just leads to bad habits. Commented Jun 25, 2016 at 20:05
  • 1
    Oh, I get your point. Yes, in the first block, the iris$ is what is actually changing the model for t2. Basically, you are trying to give a subset to the data argument, then overwriting that when you use iris$. Just remove the iris$ from t2 in the first block and it will match the other subsetted output. Commented Jun 25, 2016 at 20:12

2 Answers 2

1

Its a scoping issue. If you do:

t1=lm(iris$Sepal.Length ~ iris$Sepal.Width + iris$Petal.Length , data=iris) t2=lm(Sepal.Length ~ Sepal.Width + Petal.Length , data=iris[iris$Species=="setosa",]) t3=lm(iris$Sepal.Length ~ iris$Sepal.Width + iris$Petal.Length , data=iris , subset = (iris$Species=="setosa")) 

You get the desired result.

 coef(t1) == coef(t2) (Intercept) iris$Sepal.Width iris$Petal.Length FALSE FALSE FALSE coef(t2) == coef(t3) (Intercept) Sepal.Width Petal.Length TRUE TRUE TRUE 

When you say iris$Sepal.Length, R already knows where to look for that value. The subset argument is thus redundant and R ignores it. As mentioned in the comments, there is no need to use foo$bar when data = foo is supplied, and this situation looks to be a good example of why not to do so.

Sign up to request clarification or add additional context in comments.

Comments

1

Two methods for conducting a linear model on a subset:

Creating the subset manually

setosa <- subset(iris, subset = Species == "setosa") t1 <- lm(Sepal.Length ~ Sepal.Width + Petal.Length, data=setosa) 

Using the subset argument in lm()

t2 <- lm(Sepal.Length ~ Sepal.Width + Petal.Length, data=iris, subset = Species == "setosa") 

t1 and t2 are equivalent. However, if you use iris$ in the lm() call, R ignores what is passed to data (and possibly subset), since you are explicitly giving the vectors to the function rather than the dataframe. This is an incorrect way to use lm().

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.