Revisions to How to extract specific words from a string with pattern in R

added 1238 characters in body

Source Link

edited Jun 5, 2022 at 11:44

Chris Ruehlemann

21.5k
4
15
45

Explanation (as requested by OP):

The regular expression in extract's regex expression is designed to do two tasks:

(i) it must describe the string as a whole, from beginning to end

(ii) it must pick out those elements that should populate the newly created columns

Task (i) is achieved in that (\\w+\\s\\w+) captures the two words that make up the Supvervisor name, while \\s describes (but does not capture) the following whitespace and (.*) describes/matches anything that follows that whitespace - i.e., in this case the four Advisor names.

Task (ii) is achieved by wrapping the Supvervisor name and the Advisor names in capturing groups given in parentheses; these parentheses are the 'syntax' by which the function extract 'realizes' that their content should go into the new columns.

The commas finally are inserted between the Advisor names again using a capturing group, which can be recollected in gsub's replacment argument using backreference (\\1). The (?!$) expression is a negative lookahead to assert that the comma is to be inserted only if what follows the word boundary anchor \\bis not (hence the ! in the lookahead) the end of the string (expressed in $). Hope this helps!

Data:

Explanation (as requested by OP):

The regular expression in extract's regex expression is designed to do two tasks:

(i) it must describe the string as a whole, from beginning to end

(ii) it must pick out those elements that should populate the newly created columns

Task (i) is achieved in that (\\w+\\s\\w+) captures the two words that make up the Supvervisor name, while \\s describes (but does not capture) the following whitespace and (.*) describes/matches anything that follows that whitespace - i.e., in this case the four Advisor names.

Task (ii) is achieved by wrapping the Supvervisor name and the Advisor names in capturing groups given in parentheses; these parentheses are the 'syntax' by which the function extract 'realizes' that their content should go into the new columns.

The commas finally are inserted between the Advisor names again using a capturing group, which can be recollected in gsub's replacment argument using backreference (\\1). The (?!$) expression is a negative lookahead to assert that the comma is to be inserted only if what follows the word boundary anchor \\bis not (hence the ! in the lookahead) the end of the string (expressed in $). Hope this helps!

Data:

added 485 characters in body

Source Link

edited Jun 5, 2022 at 7:15

Chris Ruehlemann

21.5k
4
15
45

Here's an attempt with extract:

library(tidyr) DF %>% # clean strings: mutate(Names = gsub("\\s?(Name|Family|First supervisor|Advisor|Type|\\d|\\s[,:])", "", Names, perl = TRUE)) %>% # extract data into columns: extract(Names, into = c("Supervisor", "Advisor"), regex = "(\\w+\\s\\w+)\\s(.*)") %>% # insert commas into `Advisor`: mutate(Advisor = gsub("(\\w+\\s\\w+\\b)(?!$)", "\\1,", Advisor, perl = TRUE)) Supervisor Advisor 1 Ali Ahmadi Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi 2 Ali Ahmadi Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi 3 Ali Ahmadi Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi

Data:

DF<-data.frame(Names=c("Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3", "Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3", "Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3"))

Here's an attempt with extract:

library(tidyr) DF %>% mutate(Names = gsub("\\s?(Name|Family|First supervisor|Advisor|Type|\\d|\\s[,:])", "", Names)) %>% extract(Names, into = c("Supervisor", "Advisor"), regex = "(\\w+\\s\\w+)\\s(.*)")

Here's an attempt with extract:

library(tidyr) DF %>% # clean strings: mutate(Names = gsub("\\s?(Name|Family|First supervisor|Advisor|Type|\\d|\\s[,:])", "", Names, perl = TRUE)) %>% # extract data into columns: extract(Names, into = c("Supervisor", "Advisor"), regex = "(\\w+\\s\\w+)\\s(.*)") %>% # insert commas into `Advisor`: mutate(Advisor = gsub("(\\w+\\s\\w+\\b)(?!$)", "\\1,", Advisor, perl = TRUE)) Supervisor Advisor 1 Ali Ahmadi Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi 2 Ali Ahmadi Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi 3 Ali Ahmadi Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi

Data:

DF<-data.frame(Names=c("Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3", "Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3", "Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3"))

Source Link

answered Jun 5, 2022 at 7:08

Chris Ruehlemann

21.5k
4
15
45

Here's an attempt with extract:

library(tidyr) DF %>% mutate(Names = gsub("\\s?(Name|Family|First supervisor|Advisor|Type|\\d|\\s[,:])", "", Names)) %>% extract(Names, into = c("Supervisor", "Advisor"), regex = "(\\w+\\s\\w+)\\s(.*)")

Collectives™ on Stack Overflow

Return to Answer