Filter one table based on another in R

Question

I have one table(1) that looks like this (it is an all by all distance matrix transformed into a tab separated list):

sample1 sample2 405 sample3 sample4 400 sample5 sample6 1 sample7 sample8 20 sample1 sample3 40

I have another table(2) which contains those samples which meet a certain criteria:

sample1 sample2 sample8

How can I parse the first table(1) to extract only those rows in which both the variables in columns 1 and 2 can be found in table(2)?

ie desired comparisons are only:

sample1 sample2 405 sample2 sample8 40 sample8 sample1 100

The desired comparisions does not make any sense to me. Is that what you want as output? Or those values are not accurate? — MKR
– MKR, Commented Dec 28, 2017 at 22:37
sorry the values are made up - I just want to filter table(1) for all vs all pairwaise comparisons only for values found in table(2) — Sam Lipworth
– Sam Lipworth, Commented Dec 28, 2017 at 22:52

Len Greski · Accepted Answer · 2017-12-28 22:50:56Z

Here is a base R solution:

rawData1 <- "first second distance sample1 sample2 405 sample3 sample4 400 sample5 sample6 1 sample7 sample8 20 sample1 sample3 40" rawData2 <- "sample sample1 sample2 sample8" a <- read.table(textConnection(rawData1),stringsAsFactors=FALSE,header=TRUE) b <- read.table(textConnection(rawData2),stringsAsFactors=FALSE,header=TRUE) a[a$first %in% b$sample & a$second %in% b$sample, ]

...and the output:

> a[a$first %in% b$sample & a$second %in% b$sample, ] first second distance 1 sample1 sample2 405

Mirabilis · Accepted Answer · 2017-12-28 22:47:46Z

I tried a similar set-up using a dataframe for your table(1) and a vector for your table(2).

table_one <- data.frame(col_1 = c("a", "b", "c", "d"), col_2 = c("b", "d", "f", "g"), col_3 = c(1, 2, 3, 4)) table_two <- c("b", "d")

When you set it up that way, something like this should work:

library(tidyverse) table_one %>% filter(col_1 %in% table_two, col_2 %in% table_two)

MKR · Accepted Answer · 2017-12-28 23:18:46Z

The best option could be inner_join twice, both with 1st column and 2nd column and then perform intersect of two result set.

library(dplyr) df1 <- read.table(text = "Samp1 Samp2 Val sample1 sample2 405 sample3 sample4 400 sample5 sample6 1 sample7 sample8 20 sample1 sample3 40", header = TRUE, stringsAsFactors = FALSE) > df1 Samp1 Samp2 Val 1 sample1 sample2 405 2 sample3 sample4 400 3 sample5 sample6 1 4 sample7 sample8 20 5 sample1 sample3 40 df2 <- data.frame(Samp = c("sample1", "sample2", "sample8"), stringsAsFactors = FALSE) > df2 Samp 1 sample1 2 sample2 3 sample8 #use inner_join between Samp1 with Samp and then again Samp2 with Samp intersect(inner_join(df1,df2, by = c("Samp1" = "Samp")), inner_join(df1,df2, by = c("Samp2" = "Samp"))) The result will be: Samp1 Samp2 Val 1 sample1 sample2 405

I think that the result set should only contain distances where both Samp1 AND Samp2 are in df2$Samp, which is different than what is produced by your code. If you use intersect() instead of union() your code produces the correct output.
@LenGreski You have pointed out correctly. If Samp1 and Samp2 both should be in df$Samp then intersect() will be needed.

Collectives™ on Stack Overflow

Filter one table based on another in R

3 Answers 3

Comments

Comments

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

2 Comments

Related