dplyr mutate - How to properly apply custom function with mutate?

Question

I am attempting to migrate a database and would like to use R to assist in the process. As part of the migration process, I need to update "Item IDs" as they have changed. I have created a function to map the old id's to the new:

old_to_new <- function(id, df) { return (df[which(df$Old == id), ]$New) }

However, whenever I attempt to apply it to add a new column in my dataframe (loaded from an database table):

library(tidyverse) library(RODBC) cn <- odbcDriverConnect(connection="Driver={SQL Server Native Client 11.0};server=xxx;database=xxx;uid=xxx;pwd=xxx;") df <- sqlQuery(cn, "SELECT * FROM [MaintDB_New].[dbo].[Priority]") ticket_df <- sqlQuery(cn, "SELECT * FROM [MaintDB_New].[dbo].[Tickets]") ticket_details_df <- sqlQuery(cn, "SELECT * FROM [MaintDB_New].[dbo].[Ticket_Details]") new_items <- read_csv("./ticket_itm_export_temp.csv", col_names = c("Old", "Name", "New")) ticket_df_new <- ticket_df %>% mutate(item_id = old_to_new(itemID, new_items))

I receive the following error:

Error in `[[<-.data.frame`(`*tmp*`, col, value = c(NA_integer_, NA_integer_, : replacement has 280 rows, data has 69430 In addition: Warning message: In df$Old == id : longer object length is not a multiple of shorter object length

What am I doing wrong, and what is the proper approach. I received a similar error while attempting to use ddplyr.

I am new to R, so I apologize if this an obvious question.

EDIT - Added data structure:

 head(ticket_df) ticketID propertyID itemID roomNumber assignedToID isOpen openID latestID 1 11 10 1 <NA> NA 0 22 23 2 12 17 1 <NA> NA 0 24 289 3 13 17 1 <NA> NA 0 25 292 4 14 17 17 <NA> NA 0 26 4411 5 15 17 68 <NA> NA 0 27 296 6 16 17 74 <NA> NA 0 28 294 head(new_items) Old Name New <int> <chr> <int> 1 257 Register Cash Drawers 425 2 253 Alarm System 426 3 135 CREDENZA/ ARMOIRE 427 4 55 Back Office PC 428 5 183 Backup All Data 429 6 260 Base Boards 430

Links to dput output: ticket_df and new_items

I'd suggest just doing a left_join, something like ticket_df %>% left_join(new_items, by = c("id" == "Old")) %>% mutate(item_id = New). Also make sure new_items doesn't have duplicate Old entries or you'll end up with more rows than your started with. If this doesn't work, please post reproducible sample data so we can see what's going on. Use dput to give us a copy/pasteable version of the first 10 rows of the relevant data frames (looks like ticket_df and new_items are the relevant ones here). — Gregor Thomas
– Gregor Thomas, Commented Oct 17, 2018 at 14:07
Thanks, Gregor. I do not believe that will work for me. I have added the outputs from head for the first data frames. — KellyM
– KellyM, Commented Oct 17, 2018 at 14:16
Your data is still incomplete: there is nothing to tie things together between frames (no common "key"). (And please use dput.) — r2evans
– r2evans, Commented Oct 17, 2018 at 14:19
One thing about reproducible questions: the representative data needs to indicate the behavior you're trying to demonstrate. In this case, there are no matches between ticket_df$itemId and new_items$Old, so any code we might work on will do nothing. (I was trying to infer "key" based on finding any columns with matches. Thank you for clarifying the underlying data structure, so we now just need a more representative data sample ... and please use dput.) — r2evans
– r2evans, Commented Oct 17, 2018 at 14:25
I answered by changing your sample data (in the question) to give us matches. In general, many people eschew links (e.g., dropbox) in questions for two reasons: (1) hesitation to download random binary files due to the risk of viruses, etc; (2) when those links go stale, this question is no longer reproducible. The goal, keep as much as possible encapsulated within the question text as possible. This suggests trimming sample data and code to the smallest necessary to demonstrate the problem. — r2evans
– r2evans, Commented Oct 17, 2018 at 14:39

r2evans · Accepted Answer · 2018-10-17 15:45:42Z

I (really!) think Gregor's comment of left_joining makes a lot of sense. I'll force some matches by changing some of your values:

new_items$Old[1:2] <- c(17L,74L)

Now the join:

library(dplyr) ticket_df %>% left_join(select(new_items, Old, New), by=c("itemID" = "Old")) # ticketID propertyID itemID roomNumber assignedToID isOpen openID latestID New # 1 11 10 1 NA NA 0 22 23 NA # 2 12 17 1 NA NA 0 24 289 NA # 3 13 17 1 NA NA 0 25 292 NA # 4 14 17 17 NA NA 0 26 4411 425 # 5 15 17 68 NA NA 0 27 296 NA # 6 16 17 74 NA NA 0 28 294 426

If you're satisfied that this works, just reassign:

ticket_df %>% left_join(select(new_items, Old, New), by=c("itemID" = "Old")) %>% mutate(itemID = if_else(is.na(New), itemID, New)) %>% select(-New) # ticketID propertyID itemID roomNumber assignedToID isOpen openID latestID # 1 11 10 1 NA NA 0 22 23 # 2 12 17 1 NA NA 0 24 289 # 3 13 17 1 NA NA 0 25 292 # 4 14 17 425 NA NA 0 26 4411 # 5 15 17 68 NA NA 0 27 296 # 6 16 17 426 NA NA 0 28 294

Alternatively you can use mutate(itemID = coalesce(New, itemID)), thanks @Gregor.

However, if you need to use a function (perhaps your problem is more complicated or you need something more generic), then a note:

Generally, functions used within mutate need to return vectors of length 1 or the same length as what was given to it; this means subsetting (as you did with df[which(df$Old == id), ]$New) will often not work. (If you can guarantee that it will always return length 1 then it will not error, but I'm guessing that is not safe.). Similarly, summarize requires (I believe) functions to return length 1.

Here's one thought that is a little hasty but gets the same results:

myfunc <- function(id, changes) { ind <- match(id, changes[["Old"]]) indnonna <- !is.na(ind) id[which(indnonna)] <- changes[["New"]][ind[indnonna]] id } ticket_df %>% mutate(newid = myfunc(itemID, new_items)) # ticketID propertyID itemID roomNumber assignedToID isOpen openID latestID newid # 1 11 10 1 NA NA 0 22 23 1 # 2 12 17 1 NA NA 0 24 289 1 # 3 13 17 1 NA NA 0 25 292 1 # 4 14 17 17 NA NA 0 26 4411 425 # 5 15 17 68 NA NA 0 27 296 68 # 6 16 17 74 NA NA 0 28 294 426

You can obviously just assign directly to itemID instead of a different column. I still discourage this, as (1) joins are much more efficient; (2) I'd want to work with the function a bit more to perhaps find a more robust method; and (3) it hard-codes the structure of new_items (i.e., specific column names) into the function, whereas doing a join allows you to specify at join time what happens, keeping the code immediately next to the structure-using elements.

Wow, thanks so much for both a working answer and explanation. Wish I could give you more upvotes (and thanks to Gregor for the left_join recommendation).
Minor comment to a nice answer: dplyr ports the handy coalesce function from some SQL flavors as to take the first non-missing value, simplifying if_else(is.na(New), itemID, New)` to coalesce(New, itemID).
I was juggling a couple of other things and tried coalesce but it didn't work ... not sure why, it's working now, thanks @Gregor.

Collectives™ on Stack Overflow

dplyr mutate - How to properly apply custom function with mutate?

1 Answer 1

3 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Related