0

I have been using the %in% operator for a long time since I knew about it.

However, I still don't understand how it works. At least, I thought that I knew how, but I always doubt about the order of the elements.

Here you have an example:

This is my dataframe:

df <- data.frame("col1"=c(1,2,3,4,30,21,320,123,4351,1234,3,0,43), "col2"=rep("something",13)) 

This how it looks

> df col1 col2 1 1 something 2 2 something 3 3 something 4 4 something 5 30 something 6 21 something 7 320 something 8 123 something 9 4351 something 10 1234 something 11 3 something 12 0 something 13 43 something 

Let's say I have a numerical vector:

myvector <- c(30,43,12,333334,14,4351,0,5,55,66) 

And I want to check if all the numbers (or some) from my vector are in the previous dataframe. To do that, I always use %in%.

I thought 2 approaches:

#common in both: 30, 4351, 0, 43 # are the numbers from df$col1 in my vector? trial1 <- subset(df, df$col1 %in% myvector) # are the numbers of the vector in df$col1? trial2 <- subset(df, myvector %in% df$col1) 

Both approaches make sense to me and they should give the same result. However, only the result from trial1 is okay.

> trial1 col1 col2 5 30 something 9 4351 something 12 0 something 13 43 something 

What I don't understand is why the second way is giving me some common numbers and some which are not in the vector.

 col1 col2 1 1 something 2 2 something 6 21 something 7 320 something 11 3 something 12 0 something 

Could someone explain to me how `%in% operator works and why the second way gives me the wrong result?

Thanks very much in advance

Regards

4
  • %in% returns a logical vector indicating if there is a match or not for its left operand. Commented Nov 23, 2021 at 12:40
  • 2
    The first approach is the correct one, when we use "in" it creates logical vector of same size as input. Based on which the data is then subsetted. In the 2nd approach is giving nonsense subset, as the length do not match and it recycles. Commented Nov 23, 2021 at 12:40
  • Trial 2 is wrong since you are subsetting df based on the positions of the vector components (and from the documentation 'missing values are taken as false'). Commented Nov 23, 2021 at 12:42
  • 2
    The key is recycling of the different-length output, as displayed in Merijn's answer. One should always be careful to align the length of output to the number of rows in a frame; myvector %in% df$col1 will always return a vector the same length as length(myvector) regardless of nrow(df), which means that that return value is not safe for subsetting df. Commented Nov 23, 2021 at 14:15

1 Answer 1

4

Answer is given, but a bit more detailed simply look at the %in% result

df$col1 %in% myvector # [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE 

The above one is correct as you subset df and keep the TRUE values, row 5, 9, 12, 13

versus

myvector %in% df$col1 # [1] TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE 

This one goes wrong as you subset df and tell to keep 1, 2, 6, 7 and as length here is only 10 it recycles 11, 12, 13 as TRUE, TRUE, FALSE again so you get 11 and 12 in your subset as well

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.