0

I am trying to extract both the tag and the text between the tags in a text file. I am trying to achieve this using regex (Not many xml tags are there).

below is what I have tried so far

 String txt="<DATE>December</DATE>"; String re1="(<[^>]+>)"; // Tag 1 String re2="(.*?)"; // Variable Name 1 String re3="(<[^>]+>)"; // Tag 2 Pattern p = Pattern.compile(re1+re2+re3,Pattern.CASE_INSENSITIVE | Pattern.DOTALL); Matcher m = p.matcher(txt); if (m.find()) { String tag1=m.group(1); String var1=m.group(2); String tag2=m.group(3); //System.out.print("("+tag1.toString()+")"+"("+var1.toString()+")"+"("+tag2.toString()+")"+"\n"); System.out.println(tag1.toString().replaceAll("<>", "")); System.out.println(var1.toString()); } 

As an answer, I get:

<DATE> December 

How do I get rid of the <>?

2 Answers 2

2

Don't use regex to parse markup syntax, such as XML, HTML, XHTML and so on.

Many reasons are shown here.

Instead, do yourself a favor and use XPath and XQuery.

Sign up to request clarification or add additional context in comments.

1 Comment

yes, your right. But, I only have less tags in my text file (max 10 tags). Hence regex.
1

It is a bad idea to use regex to parse xml. Using a regex there is no way of identifying a complete element from opening to closing tag (a regex cannot "remember" a number of occurances).

However why your regex fails in this specific case:

In re1, re2, re3 you choose the capturing group to include < and > (also you do not include the / in re3). You could simply change this

String re1="<([^>]+)>"; // Tag 1 String re2="([^<]*)"; // Variable Name 1 String re3="</([^>]+)>"; // Tag 2 

or use a suitable regex to remove < and > form tag1:

System.out.println(tag1.toString().replaceAll("<|>", "")); 

or

System.out.println(tag1.toString().replaceAll("[<>]", "")); 

2 Comments

It works. But, It does not recognize any further tags in the sentence. EG: American Airlines made <TRIPS> 100 <TRIPS> flights in <DATE> December </DATE> it only recognizes TRIPS and 100 but not the next tag
@Betafish: <TRIPS> is not closed by a </TRIPS> tag in your example. If you want to ignore that, you could use re3 = "</?([^>]+)>" or re3 = re1.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.