constructing a model on a subset of a dataframe

Question

consider for example the "iris" dataframe which is installed with main setup of R :

names(iris) # [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species" levels(iris$Species) # [1] "setosa" "versicolor" "virginica"

now I construct three models without attaching the "iris":

t1=lm(iris$Sepal.Length ~ iris$Sepal.Width + iris$Petal.Length , data=iris) t2=lm(iris$Sepal.Length ~ iris$Sepal.Width + iris$Petal.Length , data=iris[iris$Species=="setosa",]) t3=lm(iris$Sepal.Length ~ iris$Sepal.Width + iris$Petal.Length , data=iris , subset = (iris$Species=="setosa"))

now i think t2=t3<>t1 but R says t1=t2<>t3. why I'm wrong?!!

now I construct again my models but this time with attaching the "iris":

attach(iris) t1=lm(Sepal.Length ~ Sepal.Width + Petal.Length , data=iris) t2=lm(Sepal.Length ~ Sepal.Width + Petal.Length , data=iris[iris$Species=="setosa",]) t3=lm(Sepal.Length ~ Sepal.Width + Petal.Length , data=iris , subset = (iris$Species=="setosa"))

now me and R both think: t2=t3<>t1. but again I'm confused because of the effect of attaching on model! I think first set of models is equivalent to second set of models, but R says no! thanks.

First note: you don't need to put iris$ before each variable name when a function uses data=iris. e.g., tm1=lm(Sepal.Length ~ Sepal.Width + Petal.Length, data=iris) works. — Twitch_City
– Twitch_City, Commented Jun 25, 2016 at 19:52
Second, I think it is better to create subsets outside out of the model statements. e.g., setosa <- subset(iris, subset = Species == "setosa") then you can run the linear model on the subset. — Twitch_City
– Twitch_City, Commented Jun 25, 2016 at 19:54
so is putting iris$ before each variable have some effects on models? @Twitch_City — moshtaba
– moshtaba, Commented Jun 25, 2016 at 20:03
Also, my other recommendation is to never use attach(), it just leads to bad habits. — Twitch_City
– Twitch_City, Commented Jun 25, 2016 at 20:05
Oh, I get your point. Yes, in the first block, the iris$ is what is actually changing the model for t2. Basically, you are trying to give a subset to the data argument, then overwriting that when you use iris$. Just remove the iris$ from t2 in the first block and it will match the other subsetted output. — Twitch_City
– Twitch_City, Commented Jun 25, 2016 at 20:12

Bryan Goggin · Accepted Answer · 2016-06-25 20:25:23Z

Its a scoping issue. If you do:

t1=lm(iris$Sepal.Length ~ iris$Sepal.Width + iris$Petal.Length , data=iris) t2=lm(Sepal.Length ~ Sepal.Width + Petal.Length , data=iris[iris$Species=="setosa",]) t3=lm(iris$Sepal.Length ~ iris$Sepal.Width + iris$Petal.Length , data=iris , subset = (iris$Species=="setosa"))

You get the desired result.

 coef(t1) == coef(t2) (Intercept) iris$Sepal.Width iris$Petal.Length FALSE FALSE FALSE coef(t2) == coef(t3) (Intercept) Sepal.Width Petal.Length TRUE TRUE TRUE

When you say iris$Sepal.Length, R already knows where to look for that value. The subset argument is thus redundant and R ignores it. As mentioned in the comments, there is no need to use foo$bar when data = foo is supplied, and this situation looks to be a good example of why not to do so.

Twitch_City · Accepted Answer · 2016-06-25 20:16:18Z

Two methods for conducting a linear model on a subset:

Creating the subset manually

setosa <- subset(iris, subset = Species == "setosa") t1 <- lm(Sepal.Length ~ Sepal.Width + Petal.Length, data=setosa)

Using the subset argument in `lm()`

t2 <- lm(Sepal.Length ~ Sepal.Width + Petal.Length, data=iris, subset = Species == "setosa")

t1 and t2 are equivalent. However, if you use iris$ in the lm() call, R ignores what is passed to data (and possibly subset), since you are explicitly giving the vectors to the function rather than the dataframe. This is an incorrect way to use lm().

Collectives™ on Stack Overflow

constructing a model on a subset of a dataframe

2 Answers 2

Comments

Creating the subset manually

Using the subset argument in `lm()`

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Creating the subset manually

Using the subset argument in lm()

Comments

Related

Using the subset argument in `lm()`