0

I have the following list

Acid stuff goo nasty Probable Acid more stuff Probable Acid fff ggg Probable 

I want to match everything between Acid and Probable. However my regex matches only the last match (Acid,fff,ggg,Probable) not the first (Acid,stuff, goo, nasty, Probable)

The calling class:

 public static void main(String[] args) throws IOException { PDFManager pdfManager = new PDFManager(); pdfManager.setFilePath("MyFile.pdf"); String s=pdfManager.ToText(); if(s.contains("Thresholds")){ BravoaltDoc_ExtractionNonDays Sum = new BravoaltDoc_ExtractionNonDays(s); Sum.ExtractSumNew(s); public class BravoaltDoc_ExtractionNonDays { String doc; }} ArrayList<String> Day_arr = new ArrayList<String>(); ArrayList<List<String>> Day_table2d = new ArrayList<List<String>>(); String [] seTab3Landmarks=null; public BravoaltDoc_ExtractionNonDays(String doc) { this.doc=doc; } public String ExtractSumNew(String doc) { Pattern Tab3Landmarks_pattern = Pattern.compile("Acid?(.*?)Probable",Pattern.DOTALL); Matcher matcherTab3Landmarks_pattern = Tab3Landmarks_pattern.matcher(doc); while (matcherTab3Landmarks_pattern.find()) { doc=matcherTab3Landmarks_pattern.group(1); seTab3Landmarks=matcherTab3Landmarks_pattern.group(1).split("\\n|\\r"); } for (String n:seTab3Landmarks){ System.out.println(n); } return docSlim; } } 
4
  • 1
    How are you matching against the string? You only show the pattern. Commented May 5, 2016 at 19:46
  • 3
    There's nothing wrong with your pattern (except that you're making the 'd' in Acid optional, for some reason), so the problem probably comes when you're trying to use it. Show the code where you actually matched this compiled pattern. Commented May 5, 2016 at 19:48
  • 2
    It works. The question should be closed as off-topic if you do not provide a non-working code with the description of what is wrong. Commented May 5, 2016 at 20:00
  • I have added the whole code Commented May 5, 2016 at 20:44

2 Answers 2

2

Description

This regex will do the following:

  • Match the sub strings starting with Acid to Probable
  • Requires Acid and Probable to be on their own line. If they are embedded in the middle of a string like gooProbablegoo these won't match

For this regex I used the Case Insenstive flag, and Dot matches new line Flag.

(?:\r|\n|\A)\s*Acid\s*?[\r\n].*?[\r\n]\s*Probable\s*?(?:\r|\n|\Z) 

Regular expression visualization

Example

Sample Text

Note: the difficult edge case in the third line.

Acid stuff gooProbablegoo nasty Probable Acid more stuff Probable Acid fff ggg Probable 

Matches

[0][0] = Acid stuff gooProbablegoo nasty Probable [1][0] = Acid more stuff Probable [2][0] = Acid fff ggg Probable 

Explained

NODE EXPLANATION ---------------------------------------------------------------------- (?: group, but do not capture: ---------------------------------------------------------------------- \r '\r' (carriage return) ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- \n '\n' (newline) ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- \A the beginning of the string ---------------------------------------------------------------------- ) end of grouping ---------------------------------------------------------------------- \s* whitespace (\n, \r, \t, \f, and " ") (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- Acid 'Acid' ---------------------------------------------------------------------- \s*? whitespace (\n, \r, \t, \f, and " ") (0 or more times (matching the least amount possible)) ---------------------------------------------------------------------- [\r\n] any character of: '\r' (carriage return), '\n' (newline) ---------------------------------------------------------------------- .*? any character (0 or more times (matching the least amount possible)) ---------------------------------------------------------------------- [\r\n] any character of: '\r' (carriage return), '\n' (newline) ---------------------------------------------------------------------- \s* whitespace (\n, \r, \t, \f, and " ") (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- Probable 'Probable' ---------------------------------------------------------------------- \s*? whitespace (\n, \r, \t, \f, and " ") (0 or more times (matching the least amount possible)) ---------------------------------------------------------------------- (?: group, but do not capture: ---------------------------------------------------------------------- \r '\r' (carriage return) ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- \n '\n' (newline) ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- \Z before an optional \n, and the end of the string ---------------------------------------------------------------------- ) end of grouping 
Sign up to request clarification or add additional context in comments.

4 Comments

Great explanation but the ability to only match once with the Pattern.compile is my preferred answer
For the explanation I used the match all flag in the regular expression. Which instance of the substring are you interested in getting back?
Ok, if you're just looking for the first instance of the substring, then you'd use the regex without iterating through all the matches option. See link to a live java example to see how it would look.
Cool. Thanks so much
1

Your code correctly finds all the matches. However, since each find re-assigns seTab3Landmarks, you only get the last match printed out at the end.

if you only want the first match, you should use an "if" block instead of a "while" block (which finds all matches).

5 Comments

Yes but the idea is not to find all the matches, just the first one so reassigning shouldnt really be an issue
@SebastianZeki - that doesn't make sense. if you only want to find the first one, why do you loop through all the matches? and yes, re-assigning is a problem because that means you end up with the last one, not the first one.
OK. So I tried converting the 'while' to an 'if' but I think it still iterates multiple times. An if I get rid of 'while{ } altogether I get an error. How do I go through and match only once
that still gives me the last match
My apologies. You were correct. Please post as an answer. To find the first match all I needed to do was match once. Good learning point

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.