0

Consider a multiline string consisting of N lines, like the following:

Line 1 text Line 2 text Line 3 text ... Line n-1 text Line n text anchor=value Line n+2 text Line n+3 text Line n+4 text ... Line N text 

The anchor key does not appear inside any of the lines and there may be spaces before the anchor, as well as around the = sign that follows it.

I need a regex that partitions the above string into 3 groups:

  1. Line 1 to Line n (inclusive)
  2. Anchor line (partition point)
  3. Line n+2 to Line N (inclusive)

The closest I have got to the solution is

(?s)^(?:(?!anchor\s*=\s*).)+?\r|\nanchor\s*=\s*([^\r\n]+)(?:\r|\n)(.*) 

but the above regex includes the entire text in the first matching group and populates the remaining 2 groups as expected.

An additional requirement is that the regex has to be as fast as possible, since it will be applied to large amounts of data. Note also that processing via a single regex is the only option in this use case.

Any ideas?

3 Answers 3

2

What about this regex?

(?s)^(.*?)(anchor\s*\=\s*[^\r\n]+)(.*?)

Or, to match the end of string,

(?s)^(.*?)(anchor\s*\=\s*[^\r\n]+)(.*?)$?

Sign up to request clarification or add additional context in comments.

3 Comments

The first works if the non-greedy operator in the last group becomes greedy. The seconds works as is. Thanks!
@PNS, you are welcome. I knew the (.*?) pattern might need some boundary to hit, so I added the second option.
Yup, they seem to be fine. Thanks.
1

If you need speed huge strings and regex is not the way to go. You have to have the entire string in memory to be able to use regex to tokenize it. Use Reader / InputStreams instead would be my recommendation.

5 Comments

Sure, but in this use case the output comes from a library that only allows customization via a regular expression.
Now I am even more confused. If the output comes from a library, why are you doing the splitting? Do you mean the string is returned by a library?
It is processed by a library that allows "injection" of regular expressions.
@PNS "library that only allows customization via a regular expression" this information should be placed in your question. Also describe how exactly this regex will be applied by your library: will it be argument for split or maybe for find()?
No split, it is applied to blocks of text like the one in the example, so we are probably looking for matches().
1

Well, you could first get the anchor, then split on it:

String anchor = str.replaceAll("(?ms).*?(anchor\\s*=.*?)$.*", "$1"); String lineParts = str.split("\\Q" + anchor + "\\E"); 

The "m" flag makes ^ and $ match start/end of lines.

1 Comment

Thanks but what is needed here is a single regex that does it all, because the code does not allow anything else. +1 anyway. :-)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.