2

I have a data frame in R called cryptdeltact that contains sample information as follows

# A tibble: 2,293 x 7 # Groups: Name [72] Name Detector N Value sd se ci <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> 1 VG 2H 1 SB1 C ATM 6 11.4 0.653 0.267 0.686 2 VG 2H 1 SB1 C BetaActin 6 0.0199 0.588 0.240 0.617 3 VG 2H 1 SB1 C BMPR1a 6 6.49 0.591 0.241 0.620 4 VG 2H 1 SB1 C BMPR2 6 7.19 0.614 0.251 0.645 5 VG 2H 1 SB1 C Brca1 6 11.5 0.640 0.261 0.672 6 VG 2H 1 SB1 C Brca2 6 11.9 0.840 0.343 0.882 7 VG 2H 1 SB1 C cmyc 6 8.20 0.580 0.237 0.608 8 VG 2H 1 SB1 C DNAPKCs 6 11.5 0.651 0.266 0.683 9 VG 2H 1 SB1 C Ercc1 6 11.4 0.783 0.320 0.822 10 VG 2H 1 SB1 C Fen1 6 9.05 0.629 0.257 0.660 # … with 2,283 more rows 

I want to add three new columns to this data frame: Model, Time and Region. All of the info for these new columns is contained somewhere in the existing 'Name' column. Time is the second piece of information in 'Name' ie. '0h', '2h' or '5h'. Region is the second last ie 'SB1', 'SB2', 'SB3' or 'SB4'. But Model is a combination of the first two letters and the last letter ie. 'VG C' or 'VG V' or 'WT C' or 'WT V'. I know the answer lies in extracting the appropriate info from the Name string and putting it into a new column but I am struggling with the syntax.

The final table columns would ideally look like this (once extracted I can change 'VG V' to 'VG Villus' and remove the Name column entirely)

 Model Time Region Detector N sd se ci <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> 1 VG Villus 0 SB1 Fen1 1 NA NA NA 2 VG Villus 0 SB1 Lig3 1 NA NA NA 3 VG Villus 0 SB1 PARP1 1 NA NA NA 4 VG Villus 0 SB1 PolTheta 1 NA NA NA 5 VG Villus 0 SB1 WRN 1 NA NA NA 6 VG Villus 2 SB1 Fen1 3 1.22 0.706 3.04 7 VG Villus 2 SB1 Lig3 3 2.11 1.22 5.25 8 VG Villus 2 SB1 Mre11a 3 0.601 0.347 1.49 9 VG Villus 2 SB1 PARP1 3 1.94 1.12 4.82 10 VG Villus 2 SB1 PolTheta 3 2.74 1.58 6.82 

Apologies for the base question but I'm sure this could take up far less time than it currently is!

3
  • Not clear to me how you get the values. Can you explain how do you get values in 6th row for example ? 1.22 for sd, 0.706 for se and 3.04 for ci ? Commented Nov 11, 2019 at 12:55
  • We can try to do regex, but maybe you want to consider having another data.frame that annotates your name, and has the columns Model, Time and Region? Then you can simply left join these two Commented Nov 11, 2019 at 13:02
  • @RonakShah sorry i just used that data frame as an example of the column names I want - values are the orignal values after an additional step of processing. Commented Nov 11, 2019 at 13:22

2 Answers 2

1

This should work, and it just uses base R. Also, I am going to give you a bonus, and provide you with a numerical time variable (I think this is what you want?).

(assuming your data frame is called data)

#string split to create a list of all names split_col = strsplit(as.character(data$Name), " ") #create the lists for each new variable time_var = c() region_var = c() model_var = c() #create a counter for the for loop i = 1 #go through all the name strings for (s in split_col){ #add to the lists time_var[[i]] = s[2] region_var[[i]] = s[4] model_var[[i]] = paste(s[1], s[5]) #add to the counter i = i + 1 } #add these lists to the dataset data$model = model_var data$region = region_var data$time = time_var #make the time variable numeric data$time_numeric = ifelse(data$time == '2H', 2, ifelse(data$time == '5H', 5, ifelse(data$time == '0H', 0, NA))) 

Hope that works!

Sign up to request clarification or add additional context in comments.

Comments

1

We can use tidyr extract with appropriate regex and then unite the column

library(tidyr) extract(df, Name, into = c("Model", "Time", "Region", "temp"), regex = "(.*)(\\d)H.*(SB\\d).*([A-Z])$") %>% unite(Model, Model, temp, sep = "") # Model Time Region Detector N Value sd se ci #1 VG C 2 SB1 ATM 6 11.4000 0.653 0.267 0.686 #2 VG C 2 SB1 BetaActin 6 0.0199 0.588 0.240 0.617 #3 VG C 2 SB1 BMPR1a 6 6.4900 0.591 0.241 0.620 #4 VG C 2 SB1 BMPR2 6 7.1900 0.614 0.251 0.645 #5 VG C 2 SB1 Brca1 6 11.5000 0.640 0.261 0.672 #6 VG C 2 SB1 Brca2 6 11.9000 0.840 0.343 0.882 #7 VG C 2 SB1 cmyc 6 8.2000 0.580 0.237 0.608 #8 VG C 2 SB1 DNAPKCs 6 11.5000 0.651 0.266 0.683 #9 VG C 2 SB1 Ercc1 6 11.4000 0.783 0.320 0.822 #10 VG C 2 SB1 Fen1 6 9.0500 0.629 0.257 0.660 

data

df <- structure(list(Name = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "VG 2H 1 SB1 C", class = "factor"), Detector = structure(1:10, .Label = c("ATM", "BetaActin", "BMPR1a", "BMPR2", "Brca1", "Brca2", "cmyc", "DNAPKCs", "Ercc1", "Fen1"), class = "factor"), N = c(6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L), Value = c(11.4, 0.0199, 6.49, 7.19, 11.5, 11.9, 8.2, 11.5, 11.4, 9.05), sd = c(0.653, 0.588, 0.591, 0.614, 0.64, 0.84, 0.58, 0.651, 0.783, 0.629), se = c(0.267, 0.24, 0.241, 0.251, 0.261, 0.343, 0.237, 0.266, 0.32, 0.257), ci = c(0.686, 0.617, 0.62, 0.645, 0.672, 0.882, 0.608, 0.683, 0.822, 0.66)), class = "data.frame", row.names = c(NA, -10L)) 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.