How to extract specific words from a string with pattern in R

Question

I have a dataframe which contains the names of supervisors and advisors of students' dissertations in a faculty as follows for example:

 DF<-data.frame(Names=c("Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3", "Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3", "Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3"))

I gonna separate supervisors and advisors as two distinct columns (as my expectation) like this:

DF1<-data.frame(Supervisor=c("Ali Ahmadi","Ali Ahmadi","Ali Ahmadi"),Advisors=c("Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi","Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi","Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi")) DF1 Supervisor Advisors 1 Ali Ahmadi Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi 2 Ali Ahmadi Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi 3 Ali Ahmadi Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi

I tried following codes:

DF1<-strsplit(DF$Names, "Name :") stopwords = c(":","Type","Family","Name","1","2", "3", "Advisor", "Family") DF2 <- lapply(DF1,function(x) unlist(strsplit(x," ")) ) DF3 <- lapply(DF2,function(x) x[!x %in% stopwords] ) DF4<-lapply(DF3,function(x) paste(x, collapse = " "))

But the final results as follows is not what was my expectation and apparently need further work to be converted to a datataframe!:

DF4 [[1]] [1] " Ali , Ahmadi , First supervisor Aram , Rezaeei , Omid , Saeedi , Nima , Shaki , Sohrab , Karimi ," [[2]] [1] " Ali , Ahmadi , First supervisor Aram , Rezaeei , Omid , Saeedi , Nima , Shaki , Sohrab , Karimi ," [[3]] [1] " Ali , Ahmadi , First supervisor Aram , Rezaeei , Omid , Saeedi , Nima , Shaki , Sohrab , Karimi ,"

Is there any simplified method to solve the problem? I found regexp can be helpful but I don't know how to use it atleast in the case of my example. Thanks in advance for any answer...

Chris Ruehlemann · Accepted Answer · 2022-06-05 11:44:08Z

Here's an attempt with extract:

library(tidyr) DF %>% # clean strings: mutate(Names = gsub("\\s?(Name|Family|First supervisor|Advisor|Type|\\d|\\s[,:])", "", Names, perl = TRUE)) %>% # extract data into columns: extract(Names, into = c("Supervisor", "Advisor"), regex = "(\\w+\\s\\w+)\\s(.*)") %>% # insert commas into `Advisor`: mutate(Advisor = gsub("(\\w+\\s\\w+\\b)(?!$)", "\\1,", Advisor, perl = TRUE)) Supervisor Advisor 1 Ali Ahmadi Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi 2 Ali Ahmadi Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi 3 Ali Ahmadi Aram Rezaeei, Omid Saeedi, Nima Shaki, Sohrab Karimi

Explanation (as requested by OP):

The regular expression in extract's regex expression is designed to do two tasks:

(i) it must describe the string as a whole, from beginning to end
(ii) it must pick out those elements that should populate the newly created columns

Task (i) is achieved in that (\\w+\\s\\w+) captures the two words that make up the Supvervisor name, while \\s describes (but does not capture) the following whitespace and (.*) describes/matches anything that follows that whitespace - i.e., in this case the four Advisor names.

Task (ii) is achieved by wrapping the Supvervisor name and the Advisor names in capturing groups given in parentheses; these parentheses are the 'syntax' by which the function extract 'realizes' that their content should go into the new columns.

The commas finally are inserted between the Advisor names again using a capturing group, which can be recollected in gsub's replacment argument using backreference (\\1). The (?!$) expression is a negative lookahead to assert that the comma is to be inserted only if what follows the word boundary anchor \\bis not (hence the ! in the lookahead) the end of the string (expressed in $). Hope this helps!

Data:

DF<-data.frame(Names=c("Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3", "Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3", "Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3"))

Rui Barradas · Accepted Answer · 2022-06-05 07:33:43Z

Here is a base R solution.

DF <- data.frame(Names=c("Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3", "Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3", "Name : Ali , Family : Ahmadi , Type : First supervisor Name : Aram , Family : Rezaeei , Type : Advisor Name : Omid , Family : Saeedi , Type : Advisor 1 Name : Nima , Family : Shaki , Type : Advisor 2 Name : Sohrab , Family : Karimi , Type : Advisor 3")) stopwords <- c(":","Type","Family","Name","1","2", "3", "Advisor", "Family") stoppattern <- paste(stopwords, collapse = "|") DF1 <- strsplit(DF$Names, "Name :") DF1 <- lapply(DF1, \(x) trimws(x[sapply(x, nchar) > 0L])) DF2 <- lapply(DF1, \(x) { gsub(stoppattern, "", x) }) DF3 <- lapply(DF2, \(x) { y <- gsub(stoppattern, "", x) y <- strsplit(x, ",") y <- lapply(y, trimws) lapply(y, \(.y) { .y <- trimws(.y) .y[sapply(.y, nchar) > 0L] }) }) DF4 <- lapply(DF3, \(x) { Supervisor <- x[[1]][1:2] Supervisor <- paste(trimws(Supervisor), collapse = " ") Advisors <- unlist(x[-1]) Advisors <- paste(trimws(Advisors), collapse = ", ") data.frame(Supervisor, Advisors) }) Final <- do.call(rbind, DF4) Final #> Supervisor Advisors #> 1 Ali Ahmadi Aram, Rezaeei, Omid, Saeedi, Nima, Shaki, Sohrab, Karimi #> 2 Ali Ahmadi Aram, Rezaeei, Omid, Saeedi, Nima, Shaki, Sohrab, Karimi #> 3 Ali Ahmadi Aram, Rezaeei, Omid, Saeedi, Nima, Shaki, Sohrab, Karimi

^{Created on 2022-06-05 by the reprex package (v2.0.1)}

hello_friend · Accepted Answer · 2022-06-05 12:21:47Z

Messy Base R:

# Store a vector of names: ir_names => character vector ir_names <- c("Name", "Family", "Type") # Compute it's lenght: ir_name_len => string scalar ir_name_len <- length(ir_names) # Compute the desired result: res => data.frame res <- do.call( rbind, lapply( strsplit( DF$Names, "Name\\s+\\:\\s+" ), function(x){ y <- data.frame(tmp = unlist(strsplit(x, " , "))) ir1 <- setNames( data.frame( do.call( rbind, lapply( split( y, ceiling(seq_len(nrow(y))/ir_name_len) ), t ) ), row.names = NULL, stringsAsFactors = FALSE ), ir_names ) ir2 <- transform( ir1, Name = trimws(paste(Name, gsub("Family\\s+\\:\\s+", "", Family))), Type = trimws(gsub("Type\\s+\\:\\s+", "", Type)) )[,c("Name", "Type")] ir3 <- data.frame( Supervisor = ir2$Name[which(grepl("supervisor", ir2$Type))], Advisor = toString(ir2$Name[-which(grepl("supervisor", ir2$Type))]), stringsAsFactors = FALSE, row.names = NULL ) } ) ) # Print to console: data.frame => stdout(console) res

Collectives™ on Stack Overflow

How to extract specific words from a string with pattern in R

3 Answers 3

Comments

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Related