Java Extracting Text Between Tags and Attributes

Question

I am trying to extract text between particular tags and attributes. For now, I tried to extract for tags. I am reading a ".gexf" file which has XML data inside. Then I am saving this data as a string. Then I am trying to extract text between "nodes" tag. Here is my code so far:

import java.io.BufferedReader; import java.io.FileReader; import java.io.IOException; import java.util.regex.Matcher; import java.util.regex.Pattern; public class Main { private static String filePath = "src/babel.gexf"; public String readFile(String filePath) throws IOException { BufferedReader br = new BufferedReader(new FileReader(filePath)); try { StringBuilder sb = new StringBuilder(); String line = br.readLine(); while (line != null) { sb.append(line); sb.append("\n"); line = br.readLine(); } return sb.toString(); } finally { br.close(); } } public void getNodesContent(String content) throws IOException { final Pattern pattern = Pattern.compile("<nodes>(\\w+)</nodes>", Pattern.MULTILINE); final Matcher matcher = pattern.matcher(content); while (matcher.find()) { System.out.println(matcher.group(1)); } } public static void main(String [] args) throws IOException { Main m = new Main(); String result = m.readFile(filePath); m.getNodesContent(result); } }

In the code above, I don't get any result. When I try it with sample string like "My string", I get the result. Link of the gexf (since it is too long, I had to upload it) file: https://files.fm/u/qag5ykrx

FYI: If you want to read entire file into a String, you should just do return new String(Files.readAllBytes(Paths.get(filePath))); — Andreas
– Andreas, Commented May 5, 2018 at 22:34
Since it’s just XML - why not use a XML parser and maybe xpath — MadProgrammer
– MadProgrammer, Commented May 5, 2018 at 22:35
And a quick google (since I had no idea what gefx was) shows there are a number libraries available for it - maybe consider one of those — MadProgrammer
– MadProgrammer, Commented May 5, 2018 at 22:37
I tried with XPath class but I also stucked there. Do you think it is the best way to achieve that? — rawsly
– rawsly, Commented May 5, 2018 at 23:22

DevilsHnd - 退した · Accepted Answer · 2018-05-06 01:03:44Z

I don't think placing the entire file contents into a single string is such a great idea but then I suppose that would depend upon the amount of content within the file. If it's a lot of content then I would read in that content a little differently. It would of been nice to see a fictitious example of what the file contains.

I suppose you can try this little method. The heart of it utilizes a regular expression (RegEx) along with Pattern/Matcher to retrieve the desired substring from between tags.

It is important to read the doc's with the method:

/** * This method will retrieve a string contained between string tags. You * specify what the starting and ending tags are within the startTag and * endTag parameters. It is you who determines what the start and end tags * are to be which can be any strings.<br><br> * * @param inputString (String) Any string to process.<br> * * @param startTag (String) The Start Tag String or String. Data content retrieved * will be directly after this tag.<br><br> * * The supplied Start Tag criteria can contain a single special wildcard tag * (~*~) providing you also place something like the closing chevron (>) * for an HTML tag after the wildcard tag, for example:<pre> * * If we have a string which looks like this: * {@code * "<p style=\"padding-left:40px;\">Hello</p>" * } * (Note: to pass double quote marks in a string they must be excaped) * * and we want to use this method to extract the word "Hello" from between the * two HTML tags then your Start Tag can be supplied as "&lt;p~*~&gt;" and of course * your End Tag can be "&lt;/p&gt;". The "&lt;p~*~&gt;" would be the same as supplying * "&lt;p style=\"padding-left:40px;\"&gt;". Anything between the characters &lt;p and * the supplied close chevron (&gt;) is taken into consideration. This allows for * contents extraction regardless of what HTML attributes are attached to the * tag. The use of a wildcard tag (~*~) is also allowed in a supplied End * Tag.</pre><br> * * The wildcard is used as a special tag so that strings that actually * contain asterisks (*) can be processed as regular asterisks.<br> * * @param endTag (String) The End Tag or String. Data content retrieval will * end just before this Tag is reached.<br> * * The supplied End Tag criteria can contain a single special wildcard tag * (~*~) providing you also place something like the closing chevron (&gt;) * for an HTML tag after the wildcard tag, for example:<pre> * * If we have a string which looks like this: * {@code * "<p style=\"padding-left:40px;\">Hello</p>" * } * (Note: to pass double quote marks in a string they must be excaped) * * and we want to use this method to extract the word "Hello" from between the * two HTML tags then your Start Tag can be supplied as "&lt;p style=\"padding-left:40px;\"&gt;" * and your End Tag can be "&lt;/~*~&gt;". The "&lt;/~*~&gt;" would be the same as supplying * "&lt;/p&gt;". Anything between the characters &lt;/ and the supplied close chevron (&gt;) * is taken into consideration. This allows for contents extraction regardless of what the * HTML tag might be. The use of a wildcard tag (~*~) is also allowed in a supplied Start Tag.</pre><br> * * The wildcard is used as a special tag so that strings that actually * contain asterisks (*) can be processed as regular asterisks.<br> * * @param trimFoundData (Optional - Boolean - Default is true) By default * all retrieved data is trimmed of leading and trailing white-spaces. If * you do not want this then supply false to this optional parameter. * * @return (1D String Array) If there is more than one pair of Start and End * Tags contained within the supplied input String then each set is placed * into the Array separately.<br> * * @throws IllegalArgumentException if any supplied method String argument * is Null (""). */ public static String[] getBetweenTags(String inputString, String startTag, String endTag, boolean... trimFoundData) { if (inputString == null || inputString.equals("") || startTag == null || startTag.equals("") || endTag == null || endTag.equals("")) { throw new IllegalArgumentException("\ngetBetweenTags() Method Error! - " + "A supplied method argument contains Null (\"\")!\n" + "Supplied Method Arguments:\n" + "==========================\n" + "inputString = \"" + inputString + "\"\n" + "startTag = \"" + startTag + "\"\n" + "endTag = \"" + endTag + "\"\n"); } List<String> list = new ArrayList<>(); boolean trimFound = true; if (trimFoundData.length > 0) { trimFound = trimFoundData[0]; } Matcher matcher; if (startTag.contains("~*~") || endTag.contains("~*~")) { startTag = startTag.replace("~*~", ".*?"); endTag = endTag.replace("~*~", ".*?"); Pattern pattern = Pattern.compile("(?iu)" + startTag + "(.*?)" + endTag); matcher = pattern.matcher(inputString); } else { String regexString = Pattern.quote(startTag) + "(?s)(.*?)" + Pattern.quote(endTag); Pattern pattern = Pattern.compile("(?iu)" + regexString); matcher = pattern.matcher(inputString); } while (matcher.find()) { String match = matcher.group(1); if (trimFound) { match = match.trim(); } list.add(match); } return list.toArray(new String[list.size()]); }

Thanks for your kind respond! This probably will work. I also solved my problem by using XPath. XPath seems a better way to do that. But I will also try this solution and see what is gonna happen!

Kwright02 · Accepted Answer · 2018-05-05 22:58:31Z

Without a sample of the file I can only suggest so much. On the contrary, what I can tell you is that you can get the substring of that text using a tag search loop. Here is an example:

String s = "<a>test</a><b>list</b><a>class</a>"; int start = 0, end = 0; for(int i = 0; i < s.toCharArray().length-1; i++){ if(s.toCharArray()[i] == '<' && s.toCharArray()[i+1] == 'a' && s.toCharArray()[i+2] == '>'){ start = i+3; for(int j = start+3; j < s.toCharArray().length-1; j++){ if(s.toCharArray()[j] == '<' && s.toCharArray()[j+1] == '/' && s.toCharArray()[j+2] == 'a' && s.toCharArray()[j+3] == '>'){ end = j; System.out.println(s.substring(start, end)); break; } } } }

The above code will search string s for the tag and then start where it found that and continue until it finds the closing a tag. then it uses those two positions to create a substring of the string which is the text between the two tags. You can stack as many of these tag searches as you want. Here is an example of a 2 tag search:

String s = "<a>test</a><b>list</b><a>class</a>"; int start = 0, end = 0; for(int i = 0; i < s.toCharArray().length-1; i++){ if((s.toCharArray()[i] == '<' && s.toCharArray()[i+1] == 'a' && s.toCharArray()[i+2] == '>') || (s.toCharArray()[i] == '<' && s.toCharArray()[i+1] == 'b' && s.toCharArray()[i+2] == '>')){ start = i+3; for(int j = start+3; j < s.toCharArray().length-1; j++){ if((s.toCharArray()[j] == '<' && s.toCharArray()[j+1] == '/' && s.toCharArray()[j+2] == 'a' && s.toCharArray()[j+3] == '>') || (s.toCharArray()[j] == '<' && s.toCharArray()[j+1] == '/' && s.toCharArray()[j+2] == 'b' && s.toCharArray()[j+3] == '>')){ end = j; System.out.println(s.substring(start, end)); break; } } } }

The only difference is that i've added clauses to the if statements to also get the text between b tags. This system is extreemly versatile and I think you'll fund an abundance of use for it.

Collectives™ on Stack Overflow

Java Extracting Text Between Tags and Attributes

2 Answers 2

1 Comment

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Related