1

I am trying to extract text between particular tags and attributes. For now, I tried to extract for tags. I am reading a ".gexf" file which has XML data inside. Then I am saving this data as a string. Then I am trying to extract text between "nodes" tag. Here is my code so far:

import java.io.BufferedReader; import java.io.FileReader; import java.io.IOException; import java.util.regex.Matcher; import java.util.regex.Pattern; public class Main { private static String filePath = "src/babel.gexf"; public String readFile(String filePath) throws IOException { BufferedReader br = new BufferedReader(new FileReader(filePath)); try { StringBuilder sb = new StringBuilder(); String line = br.readLine(); while (line != null) { sb.append(line); sb.append("\n"); line = br.readLine(); } return sb.toString(); } finally { br.close(); } } public void getNodesContent(String content) throws IOException { final Pattern pattern = Pattern.compile("<nodes>(\\w+)</nodes>", Pattern.MULTILINE); final Matcher matcher = pattern.matcher(content); while (matcher.find()) { System.out.println(matcher.group(1)); } } public static void main(String [] args) throws IOException { Main m = new Main(); String result = m.readFile(filePath); m.getNodesContent(result); } } 

In the code above, I don't get any result. When I try it with sample string like "My string", I get the result. Link of the gexf (since it is too long, I had to upload it) file: https://files.fm/u/qag5ykrx

5
  • FYI: If you want to read entire file into a String, you should just do return new String(Files.readAllBytes​(Paths.get(filePath))); Commented May 5, 2018 at 22:34
  • 1
    Since it’s just XML - why not use a XML parser and maybe xpath Commented May 5, 2018 at 22:35
  • What is your question? Commented May 5, 2018 at 22:36
  • And a quick google (since I had no idea what gefx was) shows there are a number libraries available for it - maybe consider one of those Commented May 5, 2018 at 22:37
  • I tried with XPath class but I also stucked there. Do you think it is the best way to achieve that? Commented May 5, 2018 at 23:22

2 Answers 2

1

I don't think placing the entire file contents into a single string is such a great idea but then I suppose that would depend upon the amount of content within the file. If it's a lot of content then I would read in that content a little differently. It would of been nice to see a fictitious example of what the file contains.

I suppose you can try this little method. The heart of it utilizes a regular expression (RegEx) along with Pattern/Matcher to retrieve the desired substring from between tags.

It is important to read the doc's with the method:

/** * This method will retrieve a string contained between string tags. You * specify what the starting and ending tags are within the startTag and * endTag parameters. It is you who determines what the start and end tags * are to be which can be any strings.<br><br> * * @param inputString (String) Any string to process.<br> * * @param startTag (String) The Start Tag String or String. Data content retrieved * will be directly after this tag.<br><br> * * The supplied Start Tag criteria can contain a single special wildcard tag * (~*~) providing you also place something like the closing chevron (>) * for an HTML tag after the wildcard tag, for example:<pre> * * If we have a string which looks like this: * {@code * "<p style=\"padding-left:40px;\">Hello</p>" * } * (Note: to pass double quote marks in a string they must be excaped) * * and we want to use this method to extract the word "Hello" from between the * two HTML tags then your Start Tag can be supplied as "&lt;p~*~&gt;" and of course * your End Tag can be "&lt;/p&gt;". The "&lt;p~*~&gt;" would be the same as supplying * "&lt;p style=\"padding-left:40px;\"&gt;". Anything between the characters &lt;p and * the supplied close chevron (&gt;) is taken into consideration. This allows for * contents extraction regardless of what HTML attributes are attached to the * tag. The use of a wildcard tag (~*~) is also allowed in a supplied End * Tag.</pre><br> * * The wildcard is used as a special tag so that strings that actually * contain asterisks (*) can be processed as regular asterisks.<br> * * @param endTag (String) The End Tag or String. Data content retrieval will * end just before this Tag is reached.<br> * * The supplied End Tag criteria can contain a single special wildcard tag * (~*~) providing you also place something like the closing chevron (&gt;) * for an HTML tag after the wildcard tag, for example:<pre> * * If we have a string which looks like this: * {@code * "<p style=\"padding-left:40px;\">Hello</p>" * } * (Note: to pass double quote marks in a string they must be excaped) * * and we want to use this method to extract the word "Hello" from between the * two HTML tags then your Start Tag can be supplied as "&lt;p style=\"padding-left:40px;\"&gt;" * and your End Tag can be "&lt;/~*~&gt;". The "&lt;/~*~&gt;" would be the same as supplying * "&lt;/p&gt;". Anything between the characters &lt;/ and the supplied close chevron (&gt;) * is taken into consideration. This allows for contents extraction regardless of what the * HTML tag might be. The use of a wildcard tag (~*~) is also allowed in a supplied Start Tag.</pre><br> * * The wildcard is used as a special tag so that strings that actually * contain asterisks (*) can be processed as regular asterisks.<br> * * @param trimFoundData (Optional - Boolean - Default is true) By default * all retrieved data is trimmed of leading and trailing white-spaces. If * you do not want this then supply false to this optional parameter. * * @return (1D String Array) If there is more than one pair of Start and End * Tags contained within the supplied input String then each set is placed * into the Array separately.<br> * * @throws IllegalArgumentException if any supplied method String argument * is Null (""). */ public static String[] getBetweenTags(String inputString, String startTag, String endTag, boolean... trimFoundData) { if (inputString == null || inputString.equals("") || startTag == null || startTag.equals("") || endTag == null || endTag.equals("")) { throw new IllegalArgumentException("\ngetBetweenTags() Method Error! - " + "A supplied method argument contains Null (\"\")!\n" + "Supplied Method Arguments:\n" + "==========================\n" + "inputString = \"" + inputString + "\"\n" + "startTag = \"" + startTag + "\"\n" + "endTag = \"" + endTag + "\"\n"); } List<String> list = new ArrayList<>(); boolean trimFound = true; if (trimFoundData.length > 0) { trimFound = trimFoundData[0]; } Matcher matcher; if (startTag.contains("~*~") || endTag.contains("~*~")) { startTag = startTag.replace("~*~", ".*?"); endTag = endTag.replace("~*~", ".*?"); Pattern pattern = Pattern.compile("(?iu)" + startTag + "(.*?)" + endTag); matcher = pattern.matcher(inputString); } else { String regexString = Pattern.quote(startTag) + "(?s)(.*?)" + Pattern.quote(endTag); Pattern pattern = Pattern.compile("(?iu)" + regexString); matcher = pattern.matcher(inputString); } while (matcher.find()) { String match = matcher.group(1); if (trimFound) { match = match.trim(); } list.add(match); } return list.toArray(new String[list.size()]); } 
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for your kind respond! This probably will work. I also solved my problem by using XPath. XPath seems a better way to do that. But I will also try this solution and see what is gonna happen!
0

Without a sample of the file I can only suggest so much. On the contrary, what I can tell you is that you can get the substring of that text using a tag search loop. Here is an example:

String s = "<a>test</a><b>list</b><a>class</a>"; int start = 0, end = 0; for(int i = 0; i < s.toCharArray().length-1; i++){ if(s.toCharArray()[i] == '<' && s.toCharArray()[i+1] == 'a' && s.toCharArray()[i+2] == '>'){ start = i+3; for(int j = start+3; j < s.toCharArray().length-1; j++){ if(s.toCharArray()[j] == '<' && s.toCharArray()[j+1] == '/' && s.toCharArray()[j+2] == 'a' && s.toCharArray()[j+3] == '>'){ end = j; System.out.println(s.substring(start, end)); break; } } } } 

The above code will search string s for the tag and then start where it found that and continue until it finds the closing a tag. then it uses those two positions to create a substring of the string which is the text between the two tags. You can stack as many of these tag searches as you want. Here is an example of a 2 tag search:

String s = "<a>test</a><b>list</b><a>class</a>"; int start = 0, end = 0; for(int i = 0; i < s.toCharArray().length-1; i++){ if((s.toCharArray()[i] == '<' && s.toCharArray()[i+1] == 'a' && s.toCharArray()[i+2] == '>') || (s.toCharArray()[i] == '<' && s.toCharArray()[i+1] == 'b' && s.toCharArray()[i+2] == '>')){ start = i+3; for(int j = start+3; j < s.toCharArray().length-1; j++){ if((s.toCharArray()[j] == '<' && s.toCharArray()[j+1] == '/' && s.toCharArray()[j+2] == 'a' && s.toCharArray()[j+3] == '>') || (s.toCharArray()[j] == '<' && s.toCharArray()[j+1] == '/' && s.toCharArray()[j+2] == 'b' && s.toCharArray()[j+3] == '>')){ end = j; System.out.println(s.substring(start, end)); break; } } } } 

The only difference is that i've added clauses to the if statements to also get the text between b tags. This system is extreemly versatile and I think you'll fund an abundance of use for it.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.