Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect parsing behaviour of inline script content: strings containing tag opening (/ closing?) characters #45

Closed
dannya opened this issue May 24, 2020 · 2 comments

Comments

@dannya
Copy link

dannya commented May 24, 2020

Take the following inline style block:

<script>
    var str = 'hey <form';

    if (!str.match(new RegExp('<(form|iframe)', 'g'))) {
        // ...
    }
</script>

... an array of 3 content strings is parsed from this script content, but I would expect this to be a single parsed content string, since any tag opening characters are within strings inside the inline script block.


Here is a complete minimal example:
(I'm using htmlnano, but I traced the behaviour to the posthtml-parser dependency of htmlnano)

const htmlnano = require('htmlnano');

return htmlnano
  .process(
    `<!DOCTYPE html>
      <html>
        <head>
          <title>Test</title>
        </head>

        <body>
          <script>
            var str = 'hey <form';

            if (!str.match(new RegExp('<(form|iframe)', 'g'))) {
              // ...
            }
          </script>
        </body>
      </html>`,
    {
      custom: [
        (tree, options) => {
          tree.match({ tag: 'script' }, (node) => {
            // node is passed in via the tree parsed by posthtml-parser

            console.log(node.content);

            // console.log output:
            // [ '\n            var str = \'hey ',
            //   '<form\';\n\n            if (!str.match(new RegExp(\'',
            //   '<(form|iframe|meta|frameset|script|link|object|embed)\', \'g\'))) {\n              //\n            }\n          ' ]

            // an array of 3 content strings is parsed, but I would 
            // expect this to be a single parsed content string, 
            // since any tag opening characters are within strings 
            // inside the inline script block

            return node;
          });

          return tree;
        },
      ]
    },
  )
  .then((result) => {
    // ...
  });

(A similar kind of issue as seen in #18)

@anikethsaha
Copy link
Member

It does looks like a bug with posthtml-parser as htmlparser5 seems to be working as expected here.

@SukkaW
Copy link

SukkaW commented Oct 14, 2020

It does looks like a bug with posthtml-parser as htmlparser2 seems to be working as expected here.

No, htmlparser2 is working that way:

https://runkit.com/sukkaw/5f86d1ee0f32d6001a5d75c0

        const script = `<script>
            var str = 'hey <form';

            if (!str.match(new RegExp('<(form|iframe)', 'g'))) {
                // ...
            }
        </script>`;
        
const htmlparser2 = require("[email protected]");
const parser = new htmlparser2.Parser({
    onopentag(name, attribs) {
        if (name === "script" && attribs.type === "text/javascript") {
            console.log("JS! Hooray!");
        }
    },
    ontext(text) {
        console.log("-->", text);
    },
    onclosetag(tagname) {
        if (tagname === "script") {
            console.log("That's it?!");
        }
    },
});
parser.write(
    script
);
parser.end();
"-->"
"\n            var str = 'hey "
"-->"
"<form';\n\n            if (!str.match(new RegExp('"
"-->"
"<(form|iframe)', 'g'))) {\n                // ...\n            }\n        "
"That's it?!"

@Scrum Scrum added this to the 0.5.1 milestone Oct 27, 2020
@Scrum Scrum closed this as completed in 8e64082 Oct 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants