Skip to main content
added 1238 characters in body
Source Link
Chris Ruehlemann
  • 21.5k
  • 4
  • 15
  • 45

Explanation (as requested by OP):

The regular expression in extract's regex expression is designed to do two tasks:

  • (i) it must describe the string as a whole, from beginning to end
  • (ii) it must pick out those elements that should populate the newly created columns

Task (i) is achieved in that (\\w+\\s\\w+) captures the two words that make up the Supvervisor name, while \\s describes (but does not capture) the following whitespace and (.*) describes/matches anything that follows that whitespace - i.e., in this case the four Advisor names.

Task (ii) is achieved by wrapping the Supvervisor name and the Advisor names in capturing groups given in parentheses; these parentheses are the 'syntax' by which the function extract 'realizes' that their content should go into the new columns.

The commas finally are inserted between the Advisor names again using a capturing group, which can be recollected in gsub's replacment argument using backreference (\\1). The (?!$) expression is a negative lookahead to assert that the comma is to be inserted only if what follows the word boundary anchor \\bis not (hence the ! in the lookahead) the end of the string (expressed in $). Hope this helps!

Data:

Data:

Explanation (as requested by OP):

The regular expression in extract's regex expression is designed to do two tasks:

  • (i) it must describe the string as a whole, from beginning to end
  • (ii) it must pick out those elements that should populate the newly created columns

Task (i) is achieved in that (\\w+\\s\\w+) captures the two words that make up the Supvervisor name, while \\s describes (but does not capture) the following whitespace and (.*) describes/matches anything that follows that whitespace - i.e., in this case the four Advisor names.

Task (ii) is achieved by wrapping the Supvervisor name and the Advisor names in capturing groups given in parentheses; these parentheses are the 'syntax' by which the function extract 'realizes' that their content should go into the new columns.

The commas finally are inserted between the Advisor names again using a capturing group, which can be recollected in gsub's replacment argument using backreference (\\1). The (?!$) expression is a negative lookahead to assert that the comma is to be inserted only if what follows the word boundary anchor \\bis not (hence the ! in the lookahead) the end of the string (expressed in $). Hope this helps!

Data:

added 485 characters in body
Source Link
Chris Ruehlemann
  • 21.5k
  • 4
  • 15
  • 45

Here's an attempt with extract:

library(tidyr) DF %>% # clean strings: mutate(Names = gsub("\\s?(Name|Family|First supervisor|Advisor|Type|\\d|\\s[,:])", "", Names, perl = TRUE)) %>% # extract data into columns: extract(Names, into = c("Supervisor", "Advisor"), regex = "(\\w+\\s\\w+)\\s(.*)") %>% # insert commas into `Advisor`: mutate(Advisor = gsub("(\\w+\\s\\w+\\b)(?!$)", "\\1,", Advisor, perl = TRUE)) Supervisor Advisor 1 Ali Ahmadi Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi 2 Ali Ahmadi Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi 3 Ali Ahmadi Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi 

Data:

DF<-data.frame(Names=c("Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3", "Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3", "Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3")) 

Here's an attempt with extract:

library(tidyr) DF %>% mutate(Names = gsub("\\s?(Name|Family|First supervisor|Advisor|Type|\\d|\\s[,:])", "", Names)) %>% extract(Names, into = c("Supervisor", "Advisor"), regex = "(\\w+\\s\\w+)\\s(.*)") 

Here's an attempt with extract:

library(tidyr) DF %>% # clean strings: mutate(Names = gsub("\\s?(Name|Family|First supervisor|Advisor|Type|\\d|\\s[,:])", "", Names, perl = TRUE)) %>% # extract data into columns: extract(Names, into = c("Supervisor", "Advisor"), regex = "(\\w+\\s\\w+)\\s(.*)") %>% # insert commas into `Advisor`: mutate(Advisor = gsub("(\\w+\\s\\w+\\b)(?!$)", "\\1,", Advisor, perl = TRUE)) Supervisor Advisor 1 Ali Ahmadi Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi 2 Ali Ahmadi Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi 3 Ali Ahmadi Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi 

Data:

DF<-data.frame(Names=c("Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3", "Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3", "Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3")) 
Source Link
Chris Ruehlemann
  • 21.5k
  • 4
  • 15
  • 45

Here's an attempt with extract:

library(tidyr) DF %>% mutate(Names = gsub("\\s?(Name|Family|First supervisor|Advisor|Type|\\d|\\s[,:])", "", Names)) %>% extract(Names, into = c("Supervisor", "Advisor"), regex = "(\\w+\\s\\w+)\\s(.*)")