Grep in R to find words with custom "extended" boundaries

Question

I'm looking for a regular expression to grep whole words, including words separated by digits or underscore. \\b considers digits and underscore as parts of words, not as boundaries.

For example, I'd like to catch MOUSE in "DOG MOUSE CAT", in "DOG MOUSE:CAT" but also in "DOG_MOUSE9CAT" and at the end or the beginning of an expression, as in "MOUSE9CAT" and "DOG_MOUSE". Basically, the boundary I'm looking for is any non-uppercase-alpha character plus beginning and end of line/expression (maybe missing some other cases caught by \\b here).

I've tried:

"[[0-9_]\\b]MOUSE[[0-9_]\\b]" "[[0-9_]|\\b]MOUSE[[0-9_]|\\b]" "[$|[^A-Z]]MOUSE[^|[^A-Z]]" "[?<=^|[^A-Z]]MOUSE[?=$|[^A-Z]]"

None of them work.

I'm actually looking for several words (based on a long vector of values), so the final result should look something like

grep(paste("\\b", paste(searchwords, collapse = "\\b|\\b"), "\\b"), targettext)

(with a different delimiter because \\b is too restrictive for me).

(This is a similar question to the one asked by user Nick Sabbe in a comment here: Using grep in R to find strings as whole words (but not strings as part of words))

Wiktor Stribiżew · Accepted Answer · 2016-11-25 10:21:04Z

Use PCRE regex with lookarounds:

grep("(?<![A-Z])MOUSE(?![A-Z])", targettext, perl=TRUE)

See the regex demo

The (?<![A-Z]) negative lookbehind will fail the match if the word is preceded with an uppercase ASCII letter and the negative lookahead (?![A-Z]) will fail the match if the word is followed with an uppercase ASCII letter.

To apply the lookarounds to all the alternatives you have, use an outer grouping (?:...|...).

See the R online demo:

> targettext <- c("DOG MOUSE CAT","DOG MOUSE:CAT","DOG_MOUSE9CAT","MOUSE9CAT","DOG_MOUSE") > searchwords <- c("MOUSE","FROG") > grep(paste0("(?<![A-Z])(?:", paste(searchwords, collapse = "|"), ")(?![A-Z])"), targettext, perl=TRUE) [1] 1 2 3 4 5

Fantastic! thank you so much. This actually returns the same result as \\b with my data (which is very long), but my confidence level has increased thanks to you :)
The \b meaning is context dependent, while the lookarounds provide a way to use unambiguous and customizable boundaries.

Abraham JA · Accepted Answer · 2023-09-14 17:25:09Z

Another way to do this is using rflashtext library

Build a KeywordProcessor object using the next parameters:

keys: The words you want to search. In this case c("DOG", "MOUSE", "CAT")
chars: The characters used to validate if a word continue, opposite to the boundary. In your case uppercase letters paste(LETTERS, collapse = "")

Use the function find_keys to search the keys on each sentence. Set span_info as FALSE to only retrieve the words, for retrieving the words and the position of the matches use TRUE.

To get a the same output as grep use which combined with lengths and unlist

library(rflashtext) processor <- KeywordProcessor$new(keys = c("DOG", "MOUSE", "CAT"), chars = paste(LETTERS, collapse = "")) found <- processor$find_keys(sentences = c("DOG MOUSE CAT", "DOG MOUSE:CAT", "DOG_MOUSE9CAT", "MOUSE9CAT", "DOG_MOUSE"), span_info = FALSE) found [[1]] [[1]]$word [1] "DOG" "MOUSE" "CAT" [[2]] [[2]]$word [1] "DOG" "MOUSE" "CAT" [[3]] [[3]]$word [1] "DOG" "MOUSE" "CAT" [[4]] [[4]]$word [1] "MOUSE" "CAT" [[5]] [[5]]$word [1] "DOG" "MOUSE" which(lengths(unlist(found, recursive = FALSE, use.names = FALSE)) > 0) [1] 1 2 3 4 5

Collectives™ on Stack Overflow

Grep in R to find words with custom "extended" boundaries

2 Answers 2

2 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Linked

Related