Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrap Datanodes in CDATA when producing XML #1720

Merged
merged 3 commits into from
Oct 24, 2023

Conversation

maxfortun
Copy link
Contributor

When parsing html with ampersands in <script> sections and outputting xhtml, the ampersands remain unescaped and break parsing of the resulting document. Setting XML syntax, as suggested in [#202], does no result in <script> content being wrapped in CDATA.

Input html:

<!DOCTYPE html>
<html lang="en">
    <head>
        <title>test</title>
        <script>
            var html = "This is '&nbsp;'";
        </script>
    </head>
    <body>
        <p>
            var html = "This is '&nbsp;'";
        </p>
    </body>
</html>

Java code:

Document document = Jsoup.parse(html);
document.outputSettings().charset("UTF-8");
document.outputSettings().syntax(Document.OutputSettings.Syntax.xml);
System.out.println(document.toString());

Output xhtml:

<!DOCTYPE html>
<html lang="en">
 <head>
  <title>test</title>
  <script>
                        var html = "This is '&nbsp;'";
                </script>
 </head>
 <body>
  <p> var html = "This is '&nbsp;'"; </p>
 </body>
</html>

If we also set Entities.EscapeMode.xhtml, the output will be a bit more meaningful, but <script> sections are still ignored

Java code:

Document document = Jsoup.parse(html);
document.outputSettings().charset("UTF-8");
document.outputSettings().syntax(Document.OutputSettings.Syntax.xml);
document.outputSettings().escapeMode(Entities.EscapeMode.xhtml);
System.out.println(document.toString());

Output xhtml:

<!DOCTYPE html>
<html lang="en">
 <head>
  <title>test</title>
  <script>
                        var html = "This is '&nbsp;'";
                </script>
 </head>
 <body>
  <p> var html = "This is '&#xa0;'"; </p>
 </body>
</html>

This PR wraps data node content, <script> and <style>, in CADATA to produce the following xhtml:

<!DOCTYPE html>
<html lang="en">
 <head>
  <title>test</title>
  <script><![CDATA[
                        var html = "This is '&nbsp;'";
                ]]></script>
 </head>
 <body>
  <p> var html = "This is '&#xa0;'"; </p>
 </body>
</html>

@jhy
Copy link
Owner

jhy commented Oct 24, 2023

Thanks! I modified the change to pivot on XML syntax vs EscapeMode. The EscapeMode is more for how entities should be escaped. It is conceivable to output HTML with EscapeMode = xhtml; in which case these should not go out as CDATA.

Also added test cases.

@jhy jhy changed the title CDATA Data nodes when producing xhtml Wrap Datanodes in CDATA when producing XML Oct 24, 2023
@jhy jhy self-assigned this Oct 24, 2023
@jhy jhy added this to the 1.17.1 milestone Oct 24, 2023
@jhy
Copy link
Owner

jhy commented Oct 24, 2023

(Back in the #202 days those tags were escaped as mentioned, but were not parsed as datanodes, hence the change)

@jhy jhy merged commit 1657e8f into jhy:master Oct 24, 2023
12 checks passed
nilsjorgen added a commit to navikt/spinnsyn-arkivering that referenced this pull request Feb 15, 2024
Måtte oppdatere forventet.html på grunn av Relatert til: jhy/jsoup#1720
nilsjorgen added a commit to navikt/spinnsyn-arkivering that referenced this pull request Feb 16, 2024
* Oppdater README.md
* Bump jsoup til 1.17.2

Måtte oppdatere forventet.html på grunn av Relatert til: jhy/jsoup#1720
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants