How can I get the text between two constant text?
Example:
<rate curr="KRW" unit="100">19,94</rate> 19,94
is between
"<rate curr="KRW" unit="100">" and
"</rate>" Other example:
ABCDEF getting substring between AB and EF= CD
How can I get the text between two constant text?
Example:
<rate curr="KRW" unit="100">19,94</rate> 19,94
is between
"<rate curr="KRW" unit="100">" and
"</rate>" Other example:
ABCDEF getting substring between AB and EF= CD
If you're analyzing HTML, you're probably better off going with javascript and .innerHTML(). Regex is a bit overkill.
If you want a generic solution, i.e to find a string between two strings You may use Pattern.quote() [or wrap string with \Q and \E around] to quote start and end strings and use (.*?) for a non greedy match.
See an example of its use in below snippet
@Test public void quoteText(){ String str1 = "<rate curr=\"KRW\" unit=\"100\">"; String str2 = "</rate>"; String input = "<rate curr=\"KRW\" unit=\"100\">19,94</rate>" +"<rate curr=\"KRW\" unit=\"100\"></rate>" +"<rate curr=\"KRW\" unit=\"100\">19,96</rate>"; String regex = Pattern.quote(str1)+"(.*?)"+Pattern.quote(str2); System.out.println("regex:"+regex); Pattern p = Pattern.compile(regex); Matcher m = p.matcher(input); while(m.find()){ String group = m.group(1); System.out.println("--"+group); } Output
regex:\Q<rate curr="KRW" unit="100">\E(.*?)\Q</rate>\E --19,94 -- --19,96 Note:Though its not recommended to use regex to parse entire HTML, I think there is no harm in conscious use of regex while treating HTML as plain text
The simple regex matching string you're looking for is:
(?<=<rate curr=\"KRW\" unit=\"100\">)(.*?)(?=</rate>) In Ruby, for example, this would translate to:
string = '<rate curr="KRW" unit="100">19,94</rate>' string.match("(?<=<rate curr=\"KRW\" unit=\"100\">)(.*?)(?=</rate>)").to_s # => "19,94" Thanks to Will Yu.
I suggest that you use an HTML parser. The grammar that defines HTML is a context-free grammar, which is fundamentally too complex to be parsed by regular expressions. Even if you manage to write a regular expression that will achieve what you want, but will probably fail on some corner cases.
For instance, what if you are expected to parse the following HTML?
<rate curr="KRW" unit="100"><rate curr="KRW" unit="100">19,94</rate></rate> A regular expression may not handle this corner case properly.