Skip to content
This repository has been archived by the owner on Apr 20, 2018. It is now read-only.

Parse HTML instead of using regex #244

Open
robwierzbowski opened this issue Nov 20, 2013 · 29 comments
Open

Parse HTML instead of using regex #244

robwierzbowski opened this issue Nov 20, 2013 · 29 comments
Assignees
Labels

Comments

@robwierzbowski
Copy link

As many issues on Usemin show, parsing HTML with regex is prone to failure.

Would you accept PRs to switch Usemin to an HTML5 parser, or is there a reason that the project prefers regexes?

@eddiemonge
Copy link
Member

im curious what that would look like

@robwierzbowski
Copy link
Author

@carols10cents ping.

@sleeper
Copy link
Contributor

sleeper commented Nov 21, 2013

Hi,

Sorry for the delay in answering.
Yes, I would accept PRs: this is something I want to do since a moment now, but being caught by other activities :(

@marcalj
Copy link

marcalj commented Dec 2, 2013

While this is not fixed you can use patterns option, but make sure that you use regexp with multiple instance behaviour (/g).

This should be advised in the README.md IMHO.

    usemin: {
      html: ['<%= yeoman.dist %>/{,*/}*.html'],
      css: ['<%= yeoman.dist %>/styles/{,*/}*.css'],
      js: '<%= yeoman.dist %>/scripts/*.js',
      options: {
        assetsDirs: ['<%= yeoman.dist %>', '<%= yeoman.dist %>/images'],
        patterns: {
          // FIXME While usemin won't have full support for revved files we have to put all references manually here
          js: [
              [/(customer-icon\.svg)/g, 'Replacing reference to customer-icon.png'],
              [/(iitail-logo\.svg)/g, 'Replacing reference to iitail-logo.png'],
              [/(no_image\.png)/g, 'Replacing reference to no_image.png'],
              [/(printer-iconx2\.png)/g, 'Replacing reference to printer-iconx2.png'],
              [/(email-iconx2\.png)/g, 'Replacing reference to email-iconx2.png'],
              [/(coins\.png)/g, 'Replacing reference to coins.png']
          ]
        }
      }
    },

Thanks! :)

@robwierzbowski
Copy link
Author

@eddiemonge Re, what this would look like: Instead of running regex on HTML we would use an HTML parser to build a document tree, and then would be able to read values with much more certainty. We could use a CSS parser like postCSS to do the same thing with CSS. JS and other templating languages would still need a patterns/regex matcher.

I'm not sure if this would speed up or slow down the task, but checking well defined object properties for a value, and if the value matches replacing would increase the reliability of parsing. A high percentage of the issues here are "Usemin misses this reference [because it's not in a place the regex expects]" or "Usemin chokes on this string", and I still regularly run into areas where Usemin chokes on some markup.

@kylecordes
Copy link

I think this would be a very good improvement. But it might be a big enough change that little of the original project is left. I wonder if it would be better done as an alternative project (which might take a while to mature) rather than as a swap-out-the-heart pull request on this one.

@lhwparis
Copy link

why an alternative project? i fully agree with @robwierzbowski this is the only future proof way to go for this project because regex causes so many problems and fails in variouse situations. parsing an html file is a core feature of usemin so this part should be absolute failsave and thats only possible using a full html parser and no regex.

@robwierzbowski
Copy link
Author

@carols10cents and I were thinking of a simplified, more declarative workflow, and wanted to start from scratch. If @eddiemonge / other maintainers want to discuss, I'm sure we'd be interested in prototyping in separate packages and seeing if we could merge in here eventually.

But just FYI, probably not going to start work on it till later in the spring.

@robwierzbowski
Copy link
Author

I'm moving my work in this space to gulp, so won't be contributing here. Leaving the issue open for anyone else that wants to pick up the torch.

@ralyodio
Copy link

cheerio is a module that gives you a jQuery API. Its much better suited for parsing html, than regex.

@SimplGy
Copy link

SimplGy commented Jul 30, 2014

We hit a failure because we commented out one of the script tags in a build block. Never expected it to still parse and include the script. This caused a stitch-min only error for us, which is a bummer to debug.

I understand why this happened now that I understand this works on regex, but it does seem like an HTML-based design would be more robust and operate more predictably.

@kylecordes
Copy link

Very helpful Stack Overflow page, explains whether HTML should be parsed using regex:

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

@donaldpipowitch
Copy link
Contributor

Just wanted to say, that this "bug" can be a "feature", too. You can comment out a script blog which shouldn't be loaded for local development, but will be included by usemin for the optimized version.

@ralyodio
Copy link

why would want to include a comment out file?

@donaldpipowitch
Copy link
Contributor

Say you have something which becomes a js file ony in the deploy process, but is something else (like an HTML file) within development. This is mostly true for templates. Some task can add these compiled templates to an existing usemin config like https://github.com/ericclemmons/grunt-angular-templates does, but not all plugins have this option.

@eddiemonge
Copy link
Member

Then why not create a plugin that has the option?

@aendra-rininsland
Copy link
Contributor

I'm rather curious what the rationale for using regex to parse HTML was in the first place — like, hasn't everyone seen this StackOverflow answer?

Another reason for changing the implementation to parse the DOM instead of using regex is because JavaScript regex can't repeat capturing groups, which makes it incredibly difficult to support something like the srcset attribute, which can contain multiple series of filenames. See #428.

@donaldpipowitch
Copy link
Contributor

@Aendrew srcset is a serious problem with the current architecture, but as you can see here this project currently lacks a maintainer. There is currently no one capable of making such decisions and rewrites. I think #428 isn't the only tough problem. There is also the multiTarget problem (#255) or the problem if you want to rev a file which hasn't changed, but contains references/links with revved files, so you need to re-rev it because of changed URLs.
Besides that there are (imo) some API problems (like adding patterns without overwriting the default patterns). I think these problems could qualify a complete rewrite of grunt-usemin. (Just my opinion.)

@kylecordes
Copy link

My sense is that the time for usemin may be passing anyway, so finding another eager maintainer may be difficult and unnecessary. We have projects here still using grunt-usemin and gulp-usemin, but I believe a more modern approach is to automatically determine the list of files to be included using a clever file glob, a module system which understands files (like require) or add-on which uses the angular module system along with a module to file path naming convention to determine the full list of files to load.

@stephanebachelier
Copy link
Collaborator

@kylecordes I don't think it's just about determining a list of files to include. Usemin power IMHO is it's ability to replace and revved a whole repository full of links which is not a so easy task.
HTML parsing should help solving some issues but it's just a part of the whole system.

I will look into using an HTML parser, but no promise. There are lots of issues which I think most of them are users lost in usemin or going the wrong way.

@stevemao
Copy link
Contributor

cheerio or jsdom might be a good option. they both depend on htmlparser2

@stephanebachelier
Copy link
Collaborator

@stevemao thanks for the info. Will look onto it.

@stephanebachelier stephanebachelier self-assigned this Nov 15, 2014
@arthurvr
Copy link
Member

thanks for the info. Will look onto it.

@stephanebachelier What's your progress in this?

@stephanebachelier
Copy link
Collaborator

@arthurvr still working on it, will come back soon.

@stephanebachelier
Copy link
Collaborator

Just to let you know that I'm working on this as a top priority.

@lhwparis
Copy link

oh thats so great to hear thanks @stephanebachelier

@stephanebachelier
Copy link
Collaborator

To all the dev branch is a migration from regexps to an HTML Parser thanks to work from @marcelaraujo.
If anyone wants to try the dev branch and create an issue. I will review all the linked issues and add some tests cases to be sure.

@lvarayut
Copy link

@stephanebachelier Will the dev branch be merged soon? I'm using v3.1.1 and still having the exact same issue.

@stephanebachelier
Copy link
Collaborator

@lvarayut Not sure about the time. But I will migrate an existing customer project on the dev branch in the next two weeks so expect something to happen soon.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests