Parse HTML instead of using regex #244

robwierzbowski · 2013-11-20T20:00:13Z

As many issues on Usemin show, parsing HTML with regex is prone to failure.

Would you accept PRs to switch Usemin to an HTML5 parser, or is there a reason that the project prefers regexes?

eddiemonge · 2013-11-20T22:37:48Z

im curious what that would look like

robwierzbowski · 2013-11-20T23:05:09Z

@carols10cents ping.

sleeper · 2013-11-21T05:28:32Z

Hi,

Sorry for the delay in answering.
Yes, I would accept PRs: this is something I want to do since a moment now, but being caught by other activities :(

marcalj · 2013-12-02T16:35:23Z

While this is not fixed you can use patterns option, but make sure that you use regexp with multiple instance behaviour (/g).

This should be advised in the README.md IMHO.

    usemin: {
      html: ['<%= yeoman.dist %>/{,*/}*.html'],
      css: ['<%= yeoman.dist %>/styles/{,*/}*.css'],
      js: '<%= yeoman.dist %>/scripts/*.js',
      options: {
        assetsDirs: ['<%= yeoman.dist %>', '<%= yeoman.dist %>/images'],
        patterns: {
          // FIXME While usemin won't have full support for revved files we have to put all references manually here
          js: [
              [/(customer-icon\.svg)/g, 'Replacing reference to customer-icon.png'],
              [/(iitail-logo\.svg)/g, 'Replacing reference to iitail-logo.png'],
              [/(no_image\.png)/g, 'Replacing reference to no_image.png'],
              [/(printer-iconx2\.png)/g, 'Replacing reference to printer-iconx2.png'],
              [/(email-iconx2\.png)/g, 'Replacing reference to email-iconx2.png'],
              [/(coins\.png)/g, 'Replacing reference to coins.png']
          ]
        }
      }
    },

Thanks! :)

robwierzbowski · 2014-01-19T01:44:03Z

@eddiemonge Re, what this would look like: Instead of running regex on HTML we would use an HTML parser to build a document tree, and then would be able to read values with much more certainty. We could use a CSS parser like postCSS to do the same thing with CSS. JS and other templating languages would still need a patterns/regex matcher.

I'm not sure if this would speed up or slow down the task, but checking well defined object properties for a value, and if the value matches replacing would increase the reliability of parsing. A high percentage of the issues here are "Usemin misses this reference [because it's not in a place the regex expects]" or "Usemin chokes on this string", and I still regularly run into areas where Usemin chokes on some markup.

kylecordes · 2014-01-19T02:38:37Z

I think this would be a very good improvement. But it might be a big enough change that little of the original project is left. I wonder if it would be better done as an alternative project (which might take a while to mature) rather than as a swap-out-the-heart pull request on this one.

lhwparis · 2014-03-24T20:36:51Z

why an alternative project? i fully agree with @robwierzbowski this is the only future proof way to go for this project because regex causes so many problems and fails in variouse situations. parsing an html file is a core feature of usemin so this part should be absolute failsave and thats only possible using a full html parser and no regex.

robwierzbowski · 2014-03-24T20:48:24Z

@carols10cents and I were thinking of a simplified, more declarative workflow, and wanted to start from scratch. If @eddiemonge / other maintainers want to discuss, I'm sure we'd be interested in prototyping in separate packages and seeing if we could merge in here eventually.

But just FYI, probably not going to start work on it till later in the spring.

robwierzbowski · 2014-06-23T19:00:57Z

I'm moving my work in this space to gulp, so won't be contributing here. Leaving the issue open for anyone else that wants to pick up the torch.

ralyodio · 2014-06-25T17:08:58Z

cheerio is a module that gives you a jQuery API. Its much better suited for parsing html, than regex.

SimplGy · 2014-07-30T17:42:52Z

We hit a failure because we commented out one of the script tags in a build block. Never expected it to still parse and include the script. This caused a stitch-min only error for us, which is a bummer to debug.

I understand why this happened now that I understand this works on regex, but it does seem like an HTML-based design would be more robust and operate more predictably.

kylecordes · 2014-07-30T20:32:51Z

Very helpful Stack Overflow page, explains whether HTML should be parsed using regex:

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

donaldpipowitch · 2014-07-31T06:21:36Z

Just wanted to say, that this "bug" can be a "feature", too. You can comment out a script blog which shouldn't be loaded for local development, but will be included by usemin for the optimized version.

ralyodio · 2014-07-31T06:24:17Z

why would want to include a comment out file?

donaldpipowitch · 2014-07-31T06:45:00Z

Say you have something which becomes a js file ony in the deploy process, but is something else (like an HTML file) within development. This is mostly true for templates. Some task can add these compiled templates to an existing usemin config like https://github.com/ericclemmons/grunt-angular-templates does, but not all plugins have this option.

eddiemonge · 2014-07-31T16:29:03Z

Then why not create a plugin that has the option?

aendra-rininsland · 2014-10-20T16:26:45Z

I'm rather curious what the rationale for using regex to parse HTML was in the first place — like, hasn't everyone seen this StackOverflow answer?

Another reason for changing the implementation to parse the DOM instead of using regex is because JavaScript regex can't repeat capturing groups, which makes it incredibly difficult to support something like the srcset attribute, which can contain multiple series of filenames. See #428.

donaldpipowitch · 2014-10-21T06:14:26Z

@Aendrew srcset is a serious problem with the current architecture, but as you can see here this project currently lacks a maintainer. There is currently no one capable of making such decisions and rewrites. I think #428 isn't the only tough problem. There is also the multiTarget problem (#255) or the problem if you want to rev a file which hasn't changed, but contains references/links with revved files, so you need to re-rev it because of changed URLs.
Besides that there are (imo) some API problems (like adding patterns without overwriting the default patterns). I think these problems could qualify a complete rewrite of grunt-usemin. (Just my opinion.)

kylecordes · 2014-10-21T14:27:38Z

My sense is that the time for usemin may be passing anyway, so finding another eager maintainer may be difficult and unnecessary. We have projects here still using grunt-usemin and gulp-usemin, but I believe a more modern approach is to automatically determine the list of files to be included using a clever file glob, a module system which understands files (like require) or add-on which uses the angular module system along with a module to file path naming convention to determine the full list of files to load.

stephanebachelier · 2014-10-25T01:18:59Z

@kylecordes I don't think it's just about determining a list of files to include. Usemin power IMHO is it's ability to replace and revved a whole repository full of links which is not a so easy task.
HTML parsing should help solving some issues but it's just a part of the whole system.

I will look into using an HTML parser, but no promise. There are lots of issues which I think most of them are users lost in usemin or going the wrong way.

stevemao · 2014-11-14T02:56:07Z

cheerio or jsdom might be a good option. they both depend on htmlparser2

stephanebachelier · 2014-11-15T01:58:09Z

@stevemao thanks for the info. Will look onto it.

arthurvr · 2015-02-19T16:56:14Z

thanks for the info. Will look onto it.

@stephanebachelier What's your progress in this?

stephanebachelier · 2015-02-19T18:33:15Z

@arthurvr still working on it, will come back soon.

stephanebachelier · 2015-03-11T23:40:55Z

Just to let you know that I'm working on this as a top priority.

lhwparis · 2015-03-11T23:47:43Z

oh thats so great to hear thanks @stephanebachelier

stephanebachelier · 2015-04-13T15:50:40Z

To all the dev branch is a migration from regexps to an HTML Parser thanks to work from @marcelaraujo.
If anyone wants to try the dev branch and create an issue. I will review all the linked issues and add some tests cases to be sure.

lvarayut · 2015-12-24T13:34:18Z

@stephanebachelier Will the dev branch be merged soon? I'm using v3.1.1 and still having the exact same issue.

stephanebachelier · 2015-12-24T13:41:39Z

@lvarayut Not sure about the time. But I will migrate an existing customer project on the dev branch in the next two weeks so expect something to happen soon.

marcalj mentioned this issue Dec 7, 2013

"usemin:css" doesn't generate correct image paths for relative urls yeoman/yeoman#824

Closed

marcalj mentioned this issue Dec 16, 2013

adding grunt-angular-templates to the build tasks yeoman/generator-angular#277

Closed

marcalj mentioned this issue Jan 8, 2014

usemin not replacing image names in javascript files #235

Closed

robwierzbowski mentioned this issue Jan 19, 2014

html comments inside build block #144

Closed

carols10cents mentioned this issue Feb 2, 2014

Use an HTML parser to extract blocks #285

Closed

donaldpipowitch mentioned this issue Sep 5, 2014

include script only in build #433

Closed

aendra-rininsland mentioned this issue Oct 20, 2014

Task only replacing image references in <img> tags #428

Closed

stephanebachelier self-assigned this Nov 15, 2014

This was referenced Feb 19, 2015

should not process comment scripts? #503

Open

grunt usemin concat js file and image file together #513

Closed

stephanebachelier added the feature label Feb 22, 2015

This was referenced Feb 23, 2015

Insufficient linefeed detection #88

Open

collapseWhitespace in htmlmin task makes usemin destroy the html #44

Open

Add cascading alternate search paths to fix preprocessor workflow #133

Open

stephanebachelier mentioned this issue Mar 11, 2015

Linefeed issue on files with mixed types generates empty html #416

Open

This was referenced Mar 11, 2015

Data attributes containing angular "hidden" fields (e.g. data-smth="obj.$$field") are mistakenly stripped of the second $ #480

Open

Why replace data-* attributes in tags? #441

Open

sindresorhus mentioned this issue Mar 31, 2016

Help with maintainance #313

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse HTML instead of using regex #244

Parse HTML instead of using regex #244

robwierzbowski commented Nov 20, 2013

eddiemonge commented Nov 20, 2013

robwierzbowski commented Nov 20, 2013

sleeper commented Nov 21, 2013

marcalj commented Dec 2, 2013

robwierzbowski commented Jan 19, 2014

kylecordes commented Jan 19, 2014

lhwparis commented Mar 24, 2014

robwierzbowski commented Mar 24, 2014

robwierzbowski commented Jun 23, 2014

ralyodio commented Jun 25, 2014

SimplGy commented Jul 30, 2014

kylecordes commented Jul 30, 2014

donaldpipowitch commented Jul 31, 2014

ralyodio commented Jul 31, 2014

donaldpipowitch commented Jul 31, 2014

eddiemonge commented Jul 31, 2014

aendra-rininsland commented Oct 20, 2014

donaldpipowitch commented Oct 21, 2014

kylecordes commented Oct 21, 2014

stephanebachelier commented Oct 25, 2014

stevemao commented Nov 14, 2014

stephanebachelier commented Nov 15, 2014

arthurvr commented Feb 19, 2015

stephanebachelier commented Feb 19, 2015

stephanebachelier commented Mar 11, 2015

lhwparis commented Mar 11, 2015

stephanebachelier commented Apr 13, 2015

lvarayut commented Dec 24, 2015

stephanebachelier commented Dec 24, 2015

Parse HTML instead of using regex #244

Parse HTML instead of using regex #244

Comments

robwierzbowski commented Nov 20, 2013

eddiemonge commented Nov 20, 2013

robwierzbowski commented Nov 20, 2013

sleeper commented Nov 21, 2013

marcalj commented Dec 2, 2013

robwierzbowski commented Jan 19, 2014

kylecordes commented Jan 19, 2014

lhwparis commented Mar 24, 2014

robwierzbowski commented Mar 24, 2014

robwierzbowski commented Jun 23, 2014

ralyodio commented Jun 25, 2014

SimplGy commented Jul 30, 2014

kylecordes commented Jul 30, 2014

donaldpipowitch commented Jul 31, 2014

ralyodio commented Jul 31, 2014

donaldpipowitch commented Jul 31, 2014

eddiemonge commented Jul 31, 2014

aendra-rininsland commented Oct 20, 2014

donaldpipowitch commented Oct 21, 2014

kylecordes commented Oct 21, 2014

stephanebachelier commented Oct 25, 2014

stevemao commented Nov 14, 2014

stephanebachelier commented Nov 15, 2014

arthurvr commented Feb 19, 2015

stephanebachelier commented Feb 19, 2015

stephanebachelier commented Mar 11, 2015

lhwparis commented Mar 11, 2015

stephanebachelier commented Apr 13, 2015

lvarayut commented Dec 24, 2015

stephanebachelier commented Dec 24, 2015