R's grepl() to find multiple strings exists [duplicate]

Question

grepl("instance|percentage", labelTest$Text)

will return true if any one of instance or percentage is present.

How will I get true only when both the terms are present?

how about grep once with the "instance" and then do the same with "percentage"? get the replies (as T or F) and combine them ? — amonk
– amonk, Commented May 24, 2017 at 8:28
i need to populate an excel with this combination, below is code: labelTest$label[ grep("instance", labelTest$Text)] <- "combination1" so one with "instance" and other with "percentage" wont work. — toofrellik
– toofrellik, Commented May 24, 2017 at 8:32
labelTest$label[ grep("instance", labelTest$Text) & grep("percentage", labelTest$Text)] <- "combination1" is what @agerom was suggesting and should work — FlorianGD
– FlorianGD, Commented May 24, 2017 at 8:40
Above one doesn't work it is behaving as | operator, as well giving below warning: longer object length is not a multiple of shorter object length — toofrellik
– toofrellik, Commented May 24, 2017 at 8:43

AkselA · Accepted Answer · 2019-10-30 18:16:27Z

Text <- c("instance", "percentage", "n", "instance percentage", "percentage instance") grepl("instance|percentage", Text) # TRUE TRUE FALSE TRUE TRUE grepl("instance.*percentage|percentage.*instance", Text) # FALSE FALSE FALSE TRUE TRUE

The latter one works by looking for:

('instance')(any character sequence)('percentage') OR ('percentage')(any character sequence)('instance')

Naturally if you need to find any combination of more than two words, this will get pretty complicated. Then the solution mentioned in the comments would be easier to implement and read.

Another alternative that might be relevant when matching many words is to use positive look-ahead (can be thought of as a 'non-consuming' match). For this you have to activate perl regex.

# create a vector of word combinations set.seed(1) words <- c("instance", "percentage", "element", "character", "n", "o", "p") Text2 <- replicate(10, paste(sample(words, 5), collapse=" ")) # grepl with multiple positive look-ahead longperl <- grepl("(?=.*instance)(?=.*percentage)(?=.*element)(?=.*character)", Text2, perl=TRUE) # this is equivalent to the solution proposed in the comments longstrd <- grepl("instance", Text2) & grepl("percentage", Text2) & grepl("element", Text2) & grepl("character", Text2) # they produce identical results identical(longperl, longstrd)

Furthermore, if you have the patterns stored in a vector you can condense the expressions significantly, giving you

pat <- c("instance", "percentage", "element", "character") longperl <- grepl(paste0("(?=.*", pat, ")", collapse=""), Text2, perl=TRUE) longstrd <- rowSums(sapply(pat, grepl, Text2) - 1L) == 0L

As asked for in the comments, if you want to match on exact words, i.e. not match on substrings, we can specify word boundaries using \\b. E.g:

tx <- c("cent element", "percentage element", "element cent", "element centimetre") grepl("(?=.*\\bcent\\b)(?=.*element)", tx, perl=TRUE) # TRUE FALSE TRUE FALSE grepl("element", tx) & grepl("\\bcent\\b", tx) # TRUE FALSE TRUE FALSE

not sure this works with whole words. For example, replacing "instance" with "table" also seems to capture cases like "marketable". I tried adding "\\stable" to include a space before "table" but that doesn't work either. Any suggestions?
@val: If you use \\b instead to indicate a word boundary, it should work.

Sebastian Geschonke · Accepted Answer · 2020-01-07 15:13:01Z

4

This is how you will get only "TRUE" if both terms do occur in an item of the vector "labelTest$Text". I think this is the exact answer to the question and much shorter than the other solutions.

grepl("instance",labelTest$Text) & grepl("percentage",labelTest$Text)

edited Jan 7, 2020 at 15:13

answered Jan 6, 2020 at 14:13

Sebastian Geschonke

593 bronze badges

3 Comments

N. berouain Over a year ago

please clarify your answer

Sebastian Geschonke Over a year ago

Please clarify your confusion

N. berouain Over a year ago

Code-only answers are not encouraged, please next time add some context to explain how this answer will solve the problem in question.

Das_Geek · Accepted Answer · 2020-01-06 15:30:06Z

0

Use intersect and feed it a grep for each word:

library(data.table) #used for subsetting text vector below vector_of_text[ intersect( grep(vector_of_text , pattern = "pattern1"), grep(vector_of_text , pattern = "pattern2") ) ]

edited Jan 6, 2020 at 15:30

Das_Geek

2,8557 gold badges23 silver badges27 bronze badges

answered Nov 15, 2018 at 13:53

the earthling

1371 silver badge7 bronze badges

1 Comment

Rick Pack Over a year ago

I am not seeing a use of data.table in here. Can you clarify? Also, I think you are wanting: vector_of_text[ grep(vector_of_text , pattern = "pattern1") & grep(vector_of_text , pattern = "pattern2") ]. No use of intersect() and an &, but we still have the potential problem that a hit will include strings containing the search term (like "instance1" for "instance")

Collectives™ on Stack Overflow

R's grepl() to find multiple strings exists [duplicate]

3 Answers 3

2 Comments

3 Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

3 Comments

1 Comment

Linked

Related