0

I have the below regex expression in a Java code, it is taking a good deal of time to complete on some cases. Is there a way to improve it?

String decimal = "([0-9]+(\\.[0-9]+)?[/-]?)+"; String units = "(in|ft)\\.?"; String unitName = "(cu\\.? *ft|gauge|watt|rpm|ft|lbs|K|GPF|btu|mph|cfm|volt|oz|pounds|dbi|miles|amp|hour|kw|f|degrees|year)"; sizePattern.add(Pattern.compile("(?i)" + decimal + " *" + units + " *x? *" + decimal + " *" + units + " *x? *" + decimal + " *" + units + "")); sizePattern.add(Pattern.compile("(?i)" + decimal + " *" + units + " *x? *" + decimal + " *" + units)); sizePattern.add(Pattern.compile("(?i)" + decimal + " *x *" + decimal + " *" + units)); sizePattern.add(Pattern.compile("(?i)" + decimal + "( *" + units + ")")); sizePattern.add(Pattern.compile("(?i)" + decimal + "( *sq?\\.?)( *ft?\\.?)")); sizePattern.add(Pattern.compile("(?i)" + decimal + " *" + unitName)); sizePattern.add(Pattern.compile("(?i)" + decimal + "(d)")); sizePattern.add(Pattern.compile("(?i)" + decimal + "( *(%|percent))")); sizePattern.add(Pattern.compile("(?i)" + decimal)); for (Pattern p : sizePattern) { ODebug.Write(Level.FINER, "PRD-0079: Using pattern = " + p.pattern()); m = p.matcher(_data); while (m.find()) { ODebug.Write(Level.FINER, " Got => [" + m.group(0) + "]"); this.Dimensions.add(m.group(0)); _data = _data.replaceAll("\\Q" + m.group(0) + "\\E", "."); m = p.matcher(_data); } } 

String causing the issue:

Micro-Induction Cooktop provides the best in cooktop performance, safety and efficiency. Induction heats as electricity flows through a coil to produce a magnetic field under the ceramic plate. When a ferromagnetic cookware is placed on the ceramic surface, currents are induced in the cookware and instant heat is generated due to the resistance of the pan. Heat is generated to the pan only and no heat is lost. As there are no open flames, inductions are safer to use than conventional burners. Once cookware is removed, all molecular activity ceases and heating is stopped immediately.Flush surface for built-in or freestanding applicationDual functions: Cook and Warm7 power settings (100-300-500-700-900-1100-1300W)* The 2 lowest power settings cannot be actually achieved, but are ""simulated"":100W = 500W intermittently heat for 2 seconds and stop for 8 seconds300W = 500W intermittently heat for 6 seconds and stop for 4 seconds13 Keep Warm settings (100-120-140-160-180-190-210-230-250-280-300-350-390F)Touch sensitive panel with control lockUp to 8 hours timerMicro-crystal ceramic plateAutomatic pan detectionLED panelETL/ETL-Sanitation/FCC certified for household or commercial useHome Depot Protection Plan:

4
  • 7
    Can you post the complete expression that takes long without having us to concatenate it ourselves? And if possible some sample input as well? Finally please define "a good deal of time". Commented Feb 5, 2016 at 9:38
  • And if possible, a shorter version of it. Commented Feb 5, 2016 at 9:39
  • My suggestion is that rather than testing for 9 patterns seperately in a loop, you combine them with an alternation operator so that the _data is only scanned once. You need to mention that "sizePattern" is a java List type and you need to describe the _data and give some example data Commented Feb 5, 2016 at 10:23
  • You string is pretty trivial to parse, so it must be a catastrophic backtracking. Commented Feb 6, 2016 at 12:55

1 Answer 1

1

Assuming your _data is long, it's not the matching what takes the time, but rather the assignment

_data = _data.replaceAll("\\Q" + m.group(0) + "\\E", "."); 

which is O(n**2) in terms of the string length. Just don't do it.

You could do it simpler with

_data = _data.replace(m.group(0), "."); 

but just don't do it. Do you need a reduced _data at the end? If so, use a single replaceAll per pattern (it uses a StringBuffer internally and is only linear in the size of the string).

Additionally:

  • Use non-capturing groups.
  • Consider recycling the Matcher by using reset(CharSequence) and usePattern(Pattern).
  • Consider combining all the patterns into one. As all of them start the same, it could be quite efficient.
  • Your decimal can probably get slow in case there's no match. Leaving out the optional part, you get "([0-9]+)+" which can backtrack needlessly a lot. Consider using atomic groups.
Sign up to request clarification or add additional context in comments.

2 Comments

Many thanks for you response! I am trying your suggestions but they dont seem to work: I need to capture the whole match, so non-capturing groups will not work for me. As to combining all the patterns, the first pattern is taking a very long time by itself, so the issue is probably not in the number of patterns applied. I have posted the string thats causing the trouble in the question section if you want to have a look. There are no matches for the first pattern so I assume catastrophic backtracking might be an issue? How can I use atomic groups to enhance the patterns' performance?
@GeorgesLteif "I need to capture the whole match, so non-capturing groups will not work for me." - group(0) is always the whole match, no matter what. The capturing groups allow you to access group(1), etc., which you don't need. +++ Use String decimal = "([0-9]++(\\.[0-9]++)?[/-]?)+";, It's called "possessive quantifier" and it's just a shortcut for an atomic group around +. +++ You might also need things like String units = "(?>in|ft)\\.?";. In java, it's called "independent group" (but atomic seems to be the usual term.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.