Match by id and divide column values across two dataframes

Question

frames:

df 1: contains multiple rows of the same id with 500 column values

 id|val.1|val.2|...|val.500 --------------------------------- 1 | 240 | 234 |...|228 1 | 224 | 222 |...|230 1 | 238 | 240 |...|240 2 | 277 | 270 |...|255 2 | 291 | 290 |...|265 2 | 284 | 282 |...|285

df 2: contains only one unique id (row) that matches df-1 id column with 500 column values

 id|val.1|val.2|...|val.500 --------------------------------- 1 | 250 | 240 |...|245 2 | 280 | 282 |...|281

I would like to divide df 1 column values by the corresponding column value in df 2 based on their id to end up with a df 3:

 id|val.1|val.2|...|val.500 --------------------------------- 1 | 0.96| 0.98|...|0.93 1 | 0.90| 0.93|...|0.94 1 | 0.95| 1.00|...|0.98 2 | 0.99| 0.96|...|0.91 2 | 1.04| 1.03|...|0.94 2 | 1.01| 1.00|...|1.01

Basically weighting df 1 values by df 2 based on their id and column value. I've been scratching my head for a while now about the best way to go about this, and not making much progress. any guidance would be greatly appreciated. Thanks

Community · Accepted Answer · 2017-05-23 12:01:53Z

Two possible approaches:

1: 'wide'-approach

With the dplyr and purrr packages:

library(dplyr) library(purrr) df12 <- left_join(df1, df2, by = 'id') cbind(id=df12[,1], map2_df(df12[,2:4], df12[,5:7], `/`))

With the data.table package (method borrowed from here):

library(data.table) # convert to 'data.tables' setDT(df1) setDT(df2) # creates two vectors of matching columnnames xcols = names(df1)[-1] icols = paste0("i.", xcols) # join and do the calculation df1[df2, on = 'id', Map('/', mget(xcols), mget(icols)), by = .EACHI]

which both give:

 id val.1 val.2 val.3 1: 1 0.9600000 0.9750000 0.9306122 2: 1 0.8960000 0.9250000 0.9387755 3: 1 0.9520000 1.0000000 0.9795918 4: 2 0.9892857 0.9574468 0.9074733 5: 2 1.0392857 1.0283688 0.9430605 6: 2 1.0142857 1.0000000 1.0142349

2: 'long'-approach

Another option is to reshape your dataframes into long format, then merge/join them and do the calculation.

With the data.table-package:

library(data.table) dt1 <- melt(setDT(df1), id = 1) dt2 <- melt(setDT(df2), id = 1) dt1[dt2, on = c('id','variable'), value := value/i.value][]

With the dplyr and tidyr packages:

library(dplyr) library(tidyr) df1 %>% gather(variable, value, -id) %>% left_join(., df2 %>% gather(variable, value, -id), by = c('id','variable')) %>% mutate(value = value.x/value.y) %>% select(id, variable, value)

which both give:

 id variable value 1: 1 val.1 0.9600000 2: 1 val.1 0.8960000 3: 1 val.1 0.9520000 4: 2 val.1 0.9892857 5: 2 val.1 1.0392857 6: 2 val.1 1.0142857 7: 1 val.2 0.9750000 8: 1 val.2 0.9250000 9: 1 val.2 1.0000000 10: 2 val.2 0.9574468 11: 2 val.2 1.0283688 12: 2 val.2 1.0000000 13: 1 val.3 0.9306122 14: 1 val.3 0.9387755 15: 1 val.3 0.9795918 16: 2 val.3 0.9074733 17: 2 val.3 0.9430605 18: 2 val.3 1.0142349

Used data:

df1 <- structure(list(id = c(1, 1, 1, 2, 2, 2), val.1 = c(240, 224, 238, 277, 291, 284), val.2 = c(234, 222, 240, 270, 290, 282), val.3 = c(228, 230, 240, 255, 265, 285)), .Names = c("id", "val.1", "val.2", "val.3"), class = "data.frame", row.names = c(NA, -6L)) df2 <- structure(list(id = c(1, 2), val.1 = c(250, 280), val.2 = c(240, 282), val.3 = c(245, 281)), .Names = c("id", "val.1", "val.2", "val.3"), class = "data.frame", row.names = c(NA, -2L))

Thanks Maximus! I used the data.table format as I could follow the code (barely). All seriousness, how do i pull my coding skills incrementally to 50% of yours?
@AnandRoopsind thx :-) Just keep on particpating on SO: read & run the code of good answers (start with the question of the r-faq tag). Try to solve questions: at the beginning this might be frustrating, but you will get better at it. Also see the info-page of the R-tag on which a lot of great resources are listed. With regard to data.table specifically, see the Getting started wiki on GitHub

lmo · Accepted Answer · 2017-01-03 21:17:19Z

As long as the data.frames are ordered properly by column and both have the same columns, then I think the following base R code will accomplish what you want.

cbind(df1[1], df1[-1] / df2[match(df1$id, df2$id), -1]) id val.1 val.2 val.500 1 1 0.9600000 0.9750000 0.9306122 2 1 0.8960000 0.9250000 0.9387755 3 1 0.9520000 1.0000000 0.9795918 4 2 0.9892857 0.9574468 0.9074733 5 2 1.0392857 1.0283688 0.9430605 6 2 1.0142857 1.0000000 1.0142349

Here, match(df1$id, df2$id) will return the row indices of df1 that correspond to the ids of df2, so df2[match(df1$id, df2$id), -1] will return the corresponding rows of df2 as a data.frame with the id variable removed. This data.frame then matches df1 in shape when the id variable is removed and df1[-1] / df2[match(df1$id, df2$id), -1] performs the division. Finally cbind prepends the id variable to the final data.frame.

data

df1 <- structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L), val.1 = c(240L, 224L, 238L, 277L, 291L, 284L), val.2 = c(234L, 222L, 240L, 270L, 290L, 282L), val.500 = c(228L, 230L, 240L, 255L, 265L, 285L)), .Names = c("id", "val.1", "val.2", "val.500"), class = "data.frame", row.names = c(NA, -6L)) df2 <- structure(list(id = 1:2, val.1 = c(250L, 280L), val.2 = c(240L, 282L), val.500 = c(245L, 281L)), .Names = c("id", "val.1", "val.2", "val.500"), class = "data.frame", row.names = c(NA, -2L))

Collectives™ on Stack Overflow

Match by id and divide column values across two dataframes

2 Answers 2

3 Comments

Comments

Hot Network Questions