- Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Context
Version 1.20.1 has introduced the following modification:
the HTML parser no longer allows self-closing tags (
<foo />) to close HTML elements by default.[...] If you need specific HTML tags to support self-closing, you can register a custom tag via the TagSet configured in Parser.tagSet(), [...]
With this modification, any custom HTML tag that was expected to be void/selfclosing must be declared in Parser tagset.
Description
It seems this modification has introduced 2 bugs when using custom tags for which void and/or selfclosing is desired:
- a tag declared as void in not parsed as such. Any content after the tag is incorrectly included in the tag body.
There are no workaround for this bug. - a tag declared as both void and selfclosing is incorrectly printed as a void tag, leading to a non re-entrant behavior due to bug 1.
There is one workaround : printing as xml, however it is not desired and should not be required to ensure consistant parsing/printing behavior
Unittest for bug 1 when parsing/printing a tag declared as void:
private static final String EXAMPLE_URI = "https://www.example.com/"; @Test public void testVoidTagParsing() { final TagSet tagset = TagSet.Html().add(new Tag("voidtag").set(Tag.Void)); final Parser parser = Parser.htmlParser().tagSet(tagset); final String html = "<p><voidtag>Hello World</p>"; final Document doc = parser.parseInput(html, EXAMPLE_URI); // Bug with 1.21.2 : HTML was parsed as "<p><voidtag>Hello World</voidtag></p>" doc.outputSettings().syntax(Syntax.html); assertEquals("void tag must NOT include following content as their body", "<p><voidtag>Hello World</p>", doc.body().html()); // fails with 1.21.2 doc.outputSettings().syntax(Syntax.xml); assertEquals("void tag must NOT include following content as their body", "<p><voidtag />Hello World</p>", doc.body().html()); // fails with 1.21.2 } Unittest for bug 2 when parsing/printing a tag declared as both selfclosing AND void:
@Test public void testSelfClosingVoidTagParsing() { final TagSet tagset = TagSet.Html().add(new Tag("selfclosingvoidtag").set(Tag.Void).set(Tag.SelfClose)); final Parser parser = Parser.htmlParser().tagSet(tagset); final String html ="<p><selfclosingvoidtag />Hello World</p>"; final Document doc = parser.parseInput(html, EXAMPLE_URI); // Bug with 1.21.2, printed as "<p><selfclosingvoidtag>Hello World</p>" // This is a bug because it is not reentrant : if parsed again, // it is incorrectled parsed "<p><selfclosingvoidtag>Hello World</selfclosingvoidtag></p>" doc.outputSettings().syntax(Syntax.html); assertEquals("Self closing tag must be printed as self closing, not ", html, doc.body().html()); // BUG "<p><selfclosingvoidtag>Hello World</p>" assertEquals("Parsing/Printing must be reentrant", doc.body().html(), parser.parseInput(doc.body().html(), EXAMPLE_URI)); // BUG Not re-entrant doc.outputSettings().syntax(Syntax.xml); assertEquals(html, doc.body().html()); // "<p><selfclosingvoidtag />Hello World</p>" } And for completeness, a unittest to demonstrate the parsing/printng behavior for a selfclosing tag (which is correct)
// This test passes with 1.21.2 @Test public void testSelfClosingTagParsing() { final TagSet tagset = TagSet.Html().add(new Tag("selfclosingtag").set(Tag.SelfClose)); final Parser parser = Parser.htmlParser().tagSet(tagset); final String html ="<p><selfclosingtag></selfclosingtag>Hello World</p>"; final Document doc = parser.parseInput(html, EXAMPLE_URI); doc.outputSettings().syntax(Syntax.html); assertEquals(html, doc.body().html()); doc.outputSettings().syntax(Syntax.xml); assertEquals("<p><selfclosingtag />Hello World</p>", doc.body().html()); } Not being familiar enough with the internal Jsoup parsing/printing logic, I'm not sure when to look to provide a fix.
Use case
In the CMS I develop, a custom HTML syntax was created (years ago) to allow consistent dynamic rendering :
- contributors may insert rich content through a wysiwyg editor, such as links to content, media, user mention, table of content, etc. Those rich contents are saved in the CMS using custom HTML tags
<cms:link>...</cms:link>,<cms:toc/> - This custom HTML syntax is NEVER sent directly to the browser, thanks to Jsoup, the CMS performs the following processing (server side) :
- parsing and cleaning the custom HTML when contributors save it (to prevent injection).
- rendering of the custom HTML tags, when a user access/read/display a content, with the latest templates available in the CSM, ensuring all content always benefit from the latest UI/rendering.
Those tags have different properties depending on their use
- void and selfclosing : a table of content, mention,
- inline and may or may not contain inline content : link
- block and contain inline content : message, abstract
| custom HTML (stored server side) | rendering example |
|---|---|
<cms:mention data-user-id="{some-id-of-a-mentionned-user" /> | <a href="...">@John Doe</a> |
<cms:link data-content-id="{some-content-id}"/> | <a href="article-xyz">article XYZ</a> |
<cms:link data-content-id="{some-content-id}">a custom title</ourcms:link> | <a href="article-xyz">a custom title</a> |
<cms:toc/> | a rich table table of content generated from h1/h2... of the HTML |
<cms:msg data-level="warn">work in progress</cms:msg> | <div class="well bg-danger">work in progress</div> |
| ... | ... |