23
grepl("instance|percentage", labelTest$Text) 

will return true if any one of instance or percentage is present.

How will I get true only when both the terms are present?

5
  • how about grep once with the "instance" and then do the same with "percentage"? get the replies (as T or F) and combine them ? Commented May 24, 2017 at 8:28
  • See for example stackoverflow.com/questions/43803561 Commented May 24, 2017 at 8:28
  • i need to populate an excel with this combination, below is code: labelTest$label[ grep("instance", labelTest$Text)] <- "combination1" so one with "instance" and other with "percentage" wont work. Commented May 24, 2017 at 8:32
  • 1
    labelTest$label[ grep("instance", labelTest$Text) & grep("percentage", labelTest$Text)] <- "combination1" is what @agerom was suggesting and should work Commented May 24, 2017 at 8:40
  • Above one doesn't work it is behaving as | operator, as well giving below warning: longer object length is not a multiple of shorter object length Commented May 24, 2017 at 8:43

3 Answers 3

34
Text <- c("instance", "percentage", "n", "instance percentage", "percentage instance") grepl("instance|percentage", Text) # TRUE TRUE FALSE TRUE TRUE grepl("instance.*percentage|percentage.*instance", Text) # FALSE FALSE FALSE TRUE TRUE 

The latter one works by looking for:

('instance')(any character sequence)('percentage') OR ('percentage')(any character sequence)('instance') 

Naturally if you need to find any combination of more than two words, this will get pretty complicated. Then the solution mentioned in the comments would be easier to implement and read.

Another alternative that might be relevant when matching many words is to use positive look-ahead (can be thought of as a 'non-consuming' match). For this you have to activate perl regex.

# create a vector of word combinations set.seed(1) words <- c("instance", "percentage", "element", "character", "n", "o", "p") Text2 <- replicate(10, paste(sample(words, 5), collapse=" ")) # grepl with multiple positive look-ahead longperl <- grepl("(?=.*instance)(?=.*percentage)(?=.*element)(?=.*character)", Text2, perl=TRUE) # this is equivalent to the solution proposed in the comments longstrd <- grepl("instance", Text2) & grepl("percentage", Text2) & grepl("element", Text2) & grepl("character", Text2) # they produce identical results identical(longperl, longstrd) 

Furthermore, if you have the patterns stored in a vector you can condense the expressions significantly, giving you

pat <- c("instance", "percentage", "element", "character") longperl <- grepl(paste0("(?=.*", pat, ")", collapse=""), Text2, perl=TRUE) longstrd <- rowSums(sapply(pat, grepl, Text2) - 1L) == 0L 

As asked for in the comments, if you want to match on exact words, i.e. not match on substrings, we can specify word boundaries using \\b. E.g:

tx <- c("cent element", "percentage element", "element cent", "element centimetre") grepl("(?=.*\\bcent\\b)(?=.*element)", tx, perl=TRUE) # TRUE FALSE TRUE FALSE grepl("element", tx) & grepl("\\bcent\\b", tx) # TRUE FALSE TRUE FALSE 
Sign up to request clarification or add additional context in comments.

2 Comments

not sure this works with whole words. For example, replacing "instance" with "table" also seems to capture cases like "marketable". I tried adding "\\stable" to include a space before "table" but that doesn't work either. Any suggestions?
@val: If you use \\b instead to indicate a word boundary, it should work.
4

This is how you will get only "TRUE" if both terms do occur in an item of the vector "labelTest$Text". I think this is the exact answer to the question and much shorter than the other solutions.

grepl("instance",labelTest$Text) & grepl("percentage",labelTest$Text) 

3 Comments

please clarify your answer
Please clarify your confusion
Code-only answers are not encouraged, please next time add some context to explain how this answer will solve the problem in question.
0

Use intersect and feed it a grep for each word:

library(data.table) #used for subsetting text vector below vector_of_text[ intersect( grep(vector_of_text , pattern = "pattern1"), grep(vector_of_text , pattern = "pattern2") ) ] 

1 Comment

I am not seeing a use of data.table in here. Can you clarify? Also, I think you are wanting: vector_of_text[ grep(vector_of_text , pattern = "pattern1") & grep(vector_of_text , pattern = "pattern2") ]. No use of intersect() and an &, but we still have the potential problem that a hit will include strings containing the search term (like "instance1" for "instance")

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.