Split character vector at math comparisons signs in R

Question

I would like to split expression with mathematical comparisons, e.g.

unlist(strsplit("var<3", "(?=[=<>])", perl = TRUE)) unlist(strsplit("var==5", "(?=[=<>])", perl = TRUE)) unlist(strsplit("var>2", "(?=[=<>])", perl = TRUE))

The results are:

[1] "var" "<" "3" [1] "var" "=" "=" "5" [1] "var" ">" "2"

For the 2nd example above, I would like to get [1] "var" "==" "5", so the two = should be returned as a single element. How do I need to change my regular expression to achieve this? (I already tried grouping and quantifiers for "==", but nothing worked - regular expressions are not my friends...)

@Wiktor, yes, I only want to limit the splits at >, < and ==. Maybe also !=. — Daniel
– Daniel, Commented Nov 23, 2016 at 8:32
Btw, you can capture each part using sub("(.*?)([=<>].)(.*)", "\\2", "var==55", perl = TRUE) or something similar. You can also use it for splitting strsplit(sub("(.*?)([=<>].)(.*)", "\\1 \\2 \\3", "var==55", perl = TRUE), " ") but Wiktors solution is better probably — David Arenburg
– David Arenburg, Commented Nov 23, 2016 at 8:42
If ocnvenient for your use case, there is, always, the parsing option: lapply(unlist(lapply(c("var<3", "var==5", "var>2"), function(e) parse(text = e))), sapply, deparse) — alexis_laz
– alexis_laz, Commented Nov 23, 2016 at 8:45
@DavidArenburg, your above example with strplit produces [1] "var" ">5" "5" when I use "var>55" as x in sub(). — Daniel
– Daniel, Commented Nov 23, 2016 at 9:11

Wiktor Stribiżew · Accepted Answer · 2016-11-23 08:29:01Z

9

You may use a PCRE regex to match the substrings you need:

==|[<>]|(?:(?!==)[^<>])+

To also support !=, modify it as

[!=]=|[<>]|(?:(?![=!]=)[^<>])+

See the regex demo.

Details:

== - 2 = signs
| - or
[<>] - a < or >
| - or
(?:(?!==)[^<>])+ - 1 or more chars other than < and > ([^<>]) that do not start a == char sequence (a tempered greedy token).

NOTE: This is easily expandable by adding more alternatives and adjusting the tempered greedy token.

R test:

> text <- "Text1==text2<text3><More here" > res <- regmatches(text, gregexpr("==|[<>]|(?:(?!==)[^<>])+", text, perl=TRUE)) > res [[1]] [1] "Text1" "==" "text2" "<" "text3" ">" [7] "<" "More here"

answered Nov 23, 2016 at 8:29

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Tensibai Over a year ago

I had go with ([a-zA-Z0-9_]+)([^a-zA-Z0-9_]+)([a-zA-Z0-9_]+) as regex as operators chars should not appear in any side of them.

Wiktor Stribiżew Over a year ago

@Tensibai: do you mean you had to check if there are word chars on both sides of those operators? You could just use "\\b(?:[!=]=|[<>])\\b"

Tensibai Over a year ago

For regmatches to return 3 captures groups I think it's better to specify them, using perl like for \w would be better also, but I feel it's easier to understand like this. I.e: regmatches(tests,regexec("([a-zA-Z0-9_]+)([^a-zA-Z0-9_]+)([a-zA-Z0-9_]+)",tests)) where tests is a vector would give the initial with each part in it's own capture group. (That's just an alternative)

Wiktor Stribiżew Over a year ago

I think I understand now why you suggest this: OP sample input strings are short 2 part operator-separated strings. In this case, I generalized the problem to find any known operators and the rest.

Tensibai Over a year ago

Ho, yes, I kept it simple :)

Tensibai · Accepted Answer · 2016-11-23 13:09:40Z

Expanding from my idea in comments, just for the formatting:

tests=c("var==5","var<3","var.name>5") regmatches(tests,regexec("([a-zA-Z0-9_.]+)(\\W+)([a-zA-Z0-9_.]+)",tests))

\w is [a-zA-Z0-9_] and \W it's opposite [^a-zA-Z0-9_], I expanded it after comment to include . in the character class, and as R doesn't support \w in character class in base regex (need to use perl=TRUE).

So the regex search for a least 1 of \w and ., then a least 1 not in \w (to match operators) and then a least 1 of \w and dot.

Each step is captured, and this give:

[[1]] [1] "var==5" "var" "==" "5" [[2]] [1] "var<3" "var" "<" "3" [[3]] [1] "var.name>5" "var.name" ">" "5"

you may add * between each capture group if your entries could have space around the operator, if not the operator capture will get them.

nice one (I'll take it you don't mind I use you to simplify my regex) ;-p
Thanks, nice und short solution - however, this one does not work with variable names with dots, e.g. regmatches("var.name==5",regexec("(\\w+)(\\W+)(\\w+)","var.name==5")). I tried something like regmatches("var.name==5",regexec("(\\w|[.]+)(\\W+)(\\w+)","var.name==5")), but that one eats the "==" part of the character vector.
@Daniel corrected, just use a character class to add . to the allowed chars
@Daniel This: (\\w|[.]+) mean any char from \w once OR a dot at least one time, maybe (\\w|[.])+ may have worked better (for matching, the capture would be awfull) but with backtracking you'll probably have a problem

Cath · Accepted Answer · 2016-11-23 09:32:15Z

Using words' boundaries (\\b) and specifying 2 possibilities for the lookaround:

unlist(strsplit("var==5", "(?=(\\b[^a-zA-Z0-9])|(\\b[a-zA-Z0-9]\\b))", perl = TRUE)) [1] "var" "==" "5" unlist(strsplit("var<3", "(?=(\\b[^a-zA-Z0-9])|(\\b[a-zA-Z0-9]\\b))", perl = TRUE)) [1] "var" "<" "3" unlist(strsplit("var>2", "(?=(\\b[^a-zA-Z0-9])|(\\b[a-zA-Z0-9]\\b))", perl = TRUE)) [1] "var" ">" "2"

Explanation:

Split at the end of the "word" and, after, there is either a non-alphanumeric character \\b[^a-zA-Z0-9] or it is the end of the "word" and, after, there is an alphanumeric character.

EDIT:

Actually the above code would have unexpected results if the number at the end is 10 or more.
Another option is to use lookbehind and split when, before, there is either a non alphanum character followed by a word edge, or an alphanum character followed by a word edge:

strsplit("var<20", "(?<=(([^a-zA-Z0-9]\\b)|([a-zA-Z0-9]\\b)))", perl = TRUE)[[1]] #[1] "var" "<" "20" strsplit("var==20", "(?<=(([^a-zA-Z0-9]\\b)|([a-zA-Z0-9]\\b)))", perl = TRUE)[[1]] #[1] "var" "==" "20" strsplit("var!=5", "(?<=(([^a-zA-Z0-9]\\b)|([a-zA-Z0-9]\\b)))", perl = TRUE)[[1]] #[1] "var" "!=" "5"

EDIT2:

Totally stealing @Tensibai way to define alphanum(+underscore)/non alphanum characters, the above regex can be simplify to: "(?<=((\\W\\b)|(\\w\\b)))"

Collectives™ on Stack Overflow

Split character vector at math comparisons signs in R

3 Answers 3

5 Comments

4 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

4 Comments

Comments

Related