Skip to content

Incorrect handling of void tags and selfclosing tags #2425

@OlivierJaquemet

Description

@OlivierJaquemet

Context

Version 1.20.1 has introduced the following modification:

the HTML parser no longer allows self-closing tags (<foo />) to close HTML elements by default.[...] If you need specific HTML tags to support self-closing, you can register a custom tag via the TagSet configured in Parser.tagSet(), [...]

With this modification, any custom HTML tag that was expected to be void/selfclosing must be declared in Parser tagset.

Description

It seems this modification has introduced 2 bugs when using custom tags for which void and/or selfclosing is desired:

  1. a tag declared as void in not parsed as such. Any content after the tag is incorrectly included in the tag body.
    There are no workaround for this bug.
  2. a tag declared as both void and selfclosing is incorrectly printed as a void tag, leading to a non re-entrant behavior due to bug 1.
    There is one workaround : printing as xml, however it is not desired and should not be required to ensure consistant parsing/printing behavior

Unittest for bug 1 when parsing/printing a tag declared as void:

 private static final String EXAMPLE_URI = "https://www.example.com/"; @Test public void testVoidTagParsing() { final TagSet tagset = TagSet.Html().add(new Tag("voidtag").set(Tag.Void)); final Parser parser = Parser.htmlParser().tagSet(tagset); final String html = "<p><voidtag>Hello World</p>"; final Document doc = parser.parseInput(html, EXAMPLE_URI); // Bug with 1.21.2 : HTML was parsed as "<p><voidtag>Hello World</voidtag></p>" doc.outputSettings().syntax(Syntax.html); assertEquals("void tag must NOT include following content as their body", "<p><voidtag>Hello World</p>", doc.body().html()); // fails with 1.21.2 doc.outputSettings().syntax(Syntax.xml); assertEquals("void tag must NOT include following content as their body", "<p><voidtag />Hello World</p>", doc.body().html()); // fails with 1.21.2 } 

Unittest for bug 2 when parsing/printing a tag declared as both selfclosing AND void:

 @Test public void testSelfClosingVoidTagParsing() { final TagSet tagset = TagSet.Html().add(new Tag("selfclosingvoidtag").set(Tag.Void).set(Tag.SelfClose)); final Parser parser = Parser.htmlParser().tagSet(tagset); final String html ="<p><selfclosingvoidtag />Hello World</p>"; final Document doc = parser.parseInput(html, EXAMPLE_URI); // Bug with 1.21.2, printed as "<p><selfclosingvoidtag>Hello World</p>" // This is a bug because it is not reentrant : if parsed again, // it is incorrectled parsed "<p><selfclosingvoidtag>Hello World</selfclosingvoidtag></p>" doc.outputSettings().syntax(Syntax.html); assertEquals("Self closing tag must be printed as self closing, not ", html, doc.body().html()); // BUG "<p><selfclosingvoidtag>Hello World</p>" assertEquals("Parsing/Printing must be reentrant", doc.body().html(), parser.parseInput(doc.body().html(), EXAMPLE_URI)); // BUG Not re-entrant doc.outputSettings().syntax(Syntax.xml); assertEquals(html, doc.body().html()); // "<p><selfclosingvoidtag />Hello World</p>" } 

And for completeness, a unittest to demonstrate the parsing/printng behavior for a selfclosing tag (which is correct)

 // This test passes with 1.21.2 @Test public void testSelfClosingTagParsing() { final TagSet tagset = TagSet.Html().add(new Tag("selfclosingtag").set(Tag.SelfClose)); final Parser parser = Parser.htmlParser().tagSet(tagset); final String html ="<p><selfclosingtag></selfclosingtag>Hello World</p>"; final Document doc = parser.parseInput(html, EXAMPLE_URI); doc.outputSettings().syntax(Syntax.html); assertEquals(html, doc.body().html()); doc.outputSettings().syntax(Syntax.xml); assertEquals("<p><selfclosingtag />Hello World</p>", doc.body().html()); } 

Not being familiar enough with the internal Jsoup parsing/printing logic, I'm not sure when to look to provide a fix.

Use case

In the CMS I develop, a custom HTML syntax was created (years ago) to allow consistent dynamic rendering :

  • contributors may insert rich content through a wysiwyg editor, such as links to content, media, user mention, table of content, etc. Those rich contents are saved in the CMS using custom HTML tags <cms:link>...</cms:link>, <cms:toc/>
  • This custom HTML syntax is NEVER sent directly to the browser, thanks to Jsoup, the CMS performs the following processing (server side) :
    • parsing and cleaning the custom HTML when contributors save it (to prevent injection).
    • rendering of the custom HTML tags, when a user access/read/display a content, with the latest templates available in the CSM, ensuring all content always benefit from the latest UI/rendering.

Those tags have different properties depending on their use

  • void and selfclosing : a table of content, mention,
  • inline and may or may not contain inline content : link
  • block and contain inline content : message, abstract
custom HTML (stored server side) rendering example
<cms:mention data-user-id="{some-id-of-a-mentionned-user" /> <a href="...">@John Doe</a>
<cms:link data-content-id="{some-content-id}"/> <a href="article-xyz">article XYZ</a>
<cms:link data-content-id="{some-content-id}">a custom title</ourcms:link> <a href="article-xyz">a custom title</a>
<cms:toc/> a rich table table of content generated from h1/h2... of the HTML
<cms:msg data-level="warn">work in progress</cms:msg> <div class="well bg-danger">work in progress</div>
... ...

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugA confirmed bug, that we should fixfixedAn {bug|improvement} that has been {fixed|implemented}

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions