Add heuristic checking for HTML anchors #1716

phillord · 2022-01-03T12:14:36Z

As a starter for discussion, not finished yet.

Previously only anchors specified or generated in markdown could be
linked to, without complaint from the link checker. We now use a
simple heuristic check for name or id attributes.

Addresses #1707

IMPORTANT: Please do not create a Pull Request adding a new feature without discussing it first.

The place to discuss new features is the forum: https://zola.discourse.group/
If you want to add a new feature, please open a thread there first in the feature requests section.

Sanity check:

Have you checked to ensure there aren't other open Pull Requests for the same update/change?

Code changes

(Delete or ignore this section for documentation changes)

Are you doing the PR on the next branch?

If the change is a new feature or adding to/changing an existing one:

Have you created/updated the relevant documentation page(s)?

biodranik · 2022-01-04T20:28:18Z

components/utils/src/links.rs

@@ -0,0 +1,16 @@
+pub fn anchor_id_checks(anchor:&str) -> Vec<String> {
+    vec![
+        format!(" id={}", anchor),


Will regex be slower or faster here?

Not sure, this is a lot of allocation, i'm guessing the regex would be faster? But hard to say without a benchmark

Possibly faster and probably more maintainable. I just copied what was already there, but happy to refactor it out to use a regexp. This code only runs if other anchor checks fail, so any performance hit will only be on sites that are broken or have these anchors in.

The regexp crate is already a dependency, so nothing wrong with trying. Am guessing we'd need a few hundred links to make a benchmark informative.

let's go with the regex approach then

Okay, pushed. Performance testing results.

I compiled site with a two small files but one contain ~2k links to an XML anchor. I then tested various options, in all cases using cargo run but taking the time reported at the end.

Comparing regexp to direct string matching

regexp: --release: 1.3s no --release: 39s non-regexp: --release: 1.0s no --release: 19.4s

Conclusion: Regexp is slower esp for non-optimized builds.

Then comparing the non-regexp implementation with heuristic_link_check false or true

heuristic_link_check = false --release 0.5s heuristic_link_check = false no --release 1.1s

Conclusion: The heuristic link check causes a 100% in build times (for release build) and a much bigger increase for non-optimized.

However, in this case zola produces lots of error messages and printouts can be slow. So, I have also tried fixing the broken links by adding an markdown generated anchor (with the same ~2k links to it). I also tried comparing this situation with and without the heuristic link checker.

--release, no failures: 0.5s no -- release, no failures: 1.1s heuristic_link_check = false --release no failures 0.5s heuristic_link_check = false no --release no failures 1.1s

Conclusion 1: printing errors doesn't make much of a difference to the performance.
Conclusion 2: if the heuristic link checker is on, it does not make that much difference if most links are not broken. This would be predicted from the code.

So, I would say the regexp code is a bit neater, the multiple format! checking is a bit faster, so either implementation is fine. I'd probably got with the regexp cause neat trumps fast for me, but happy with what ever you prefer.

And while the heuristic_link_checker is significantly slows things down, you only pay for it if you are using XML anchors. As zola current refuses to build a site under these circumstances, it seems a reasonable price to pay.

Keats · 2022-01-05T21:39:34Z

components/utils/src/links.rs

+
+pub fn anchor_id_checks(anchor:&str) -> Regex {
+    Regex::new(
+        &format!(r#" (id|ID|name|NAME) *= *("|'){}("|')"#, anchor)


you're missing the ones where they have no quotes

I would also make regex case-insensitive to handle Name, iD, etc.

Now done, hopefully, with any obscure side effects.

Keats · 2022-01-05T21:45:59Z

docs/content/documentation/getting-started/configuration.md

@@ -130,6 +130,9 @@ skip_anchor_prefixes = [
    "https://caniuse.com/",
 ]

+# Check for links to anchors defined in HTML rather than markdown
+heuristic_link_check = true


in the end since it's only ran on local failures, it's probably fine to not need an option

Your call, but it would give the option of disabling something that can produce failures. Switching the default around might be easier?

There's already the workaround of creating the link in HTML directly

biodranik · 2022-01-06T06:54:37Z

components/site/src/link_checking.rs

+
+            !(page.has_anchor(anchor)||
+              site.config.link_checker.heuristic_link_check &&
+              page.has_anchor_id(anchor))


Is regex created for each page? Would caching it make processing faster?

For each anchor and each page. I would guess that it would only make a minor difference in most cases.

I think caching the regex wouldn't be as useful because the regex pattern contains the text of each anchor, so the likelihood any particular regex would be needed again is slim.

Unless the method was flipped around, so there was a single cached regex behind a mutex, with a pattern like r"(?i) (id|name) *= *"[^"]"? " (case insesitively match all name or id attributes (I used only double quoting for simplicity)). Then there's only one regex, which can find all matches in a given page, then you can check those matches for the anchor in question.

I think there's some chance that would be faster than creating so many fresh regexes.

@mwcz I would agree that this might be faster. Then I thought regexs would be faster than exact string matching and was wrong. We don't have a good test realistic test case for any kind of real world benchmarking. My natural inclination would be to leave the simplest implementation till it's clear that this is a problem.

For sure, +1 to keeping it simple at first.

For benchmarking, I might be able to help with that. The site I'm starting to build with zola has 130,000 pages of markdown, with a lot of name/id anchors, so it would be highly sensitive to any performance issues. I can run benchmarks myself with that content. Maybe if I can find a way to scrub the content (lorem-ipsumify it but keep the markdown layout) then I could share it.

@mwcz That would be a good benchmark but only to see it it this code does not affect it at all. This code only runs if the current link checker fails. So if your current site builds (i.e. all the links work), then this code should never run. That's a useful thing to check, but doesn't give any indication of the performance of this code. I'm assuming your anchors here are either zola generated section slugs or manually specified anchors.

What you need could do is take half the markdown files, turn them into header/footer-less HTML with pandoc or equivalent. Now you should have 65,000 markdown files with pointers to 65,000 HTML files, so half of your outlinks would be to XML specified anchors and therefore use this code.

Quite how we would interpret the results I don't know. This check is surely slower than checking for markdown anchors which have to be parsed out anyway. So, it's not clear what would be an acceptable slowdown.

The fastest implementation would be integrated into the markdown=>HTML convertor, where it's easy to collect all ids/links at the conversion stage and then to finally check at the end.

@biodranik Who knows what would be fastest, without a good test. Collecting all the ids/links would potentially mean checking a lot more links, so it might be slower. Of course, it does mean that the link checker could check all links. At the moment HTML -> HTML links that are broken would not be picked up.

It's a question of what the link checker is there for. I think it's fine that it is limited to markdown links, so long as it does not preclude me putting in HTML anchors.

Previously only anchors specified or generated in markdown could be linked to, without complaint from the link checker. We now use a simple heuristic check for `name` or `id` attributes. Duplicate code has been refactored and all XML anchor checks updated to use regex rather than substring match.

phillord · 2022-01-08T14:12:57Z

Think this is ready to go now!

Keats · 2022-01-08T23:14:44Z

components/utils/src/links.rs

+
+pub fn anchor_id_checks(anchor:&str) -> Regex {
+    Regex::new(
+        &format!(r#" (?i)(id|ID|name|NAME) *= *("|')*{}("|'| |>)+"#, anchor)


Since we're setting the ?i flag, we can get rid of the multiple cases in the first part of the regex

Keats · 2022-01-08T23:14:56Z

components/library/src/content/mod.rs

@@ -28,6 +29,12 @@ pub fn has_anchor(headings: &[Heading], anchor: &str) -> bool {
    false
 }

+
+pub fn has_anchor_id(content: &str, anchor: &str) -> bool {


I would probably move this function to utils directly since we will only ever do is_match

Keats · 2022-01-09T21:04:48Z

That looks ok to me, anyone else having comments?

biodranik · 2022-01-09T21:09:47Z

components/utils/Cargo.toml

@@ -9,6 +9,7 @@ include = ["src/**/*"]
 tera = "1"
 unicode-segmentation = "1.2"
 walkdir = "2"
+regex="1"


biodranik · 2022-01-09T21:15:38Z

components/utils/src/links.rs

+
+        // Case variants
+        assert!(m(r#"<a ID="fred">"#));
+        assert!(m(r#"<a iD="fred">"#));


Maybe add some other tags, so readers can easily understand, that the code also works with any element with id?

mwcz · 2022-01-10T02:06:08Z

That looks ok to me, anyone else having comments?

Looks good to me. It works well in my two trivial test case and in @mwcz's Giant Site, which had 493 broken anchors in 0.15.2 and with this patch is down to 209 (which seem to be real). And there was no noticable performance difference at all.

* Add heuristic checking for HTML anchors Previously only anchors specified or generated in markdown could be linked to, without complaint from the link checker. We now use a simple heuristic check for `name` or `id` attributes. Duplicate code has been refactored and all XML anchor checks updated to use regex rather than substring match. * Fix regexp and refactor

biodranik reviewed Jan 4, 2022

View reviewed changes

Keats reviewed Jan 5, 2022

View reviewed changes

biodranik reviewed Jan 6, 2022

View reviewed changes

phillord changed the base branch from master to next January 8, 2022 13:26

phillord force-pushed the feature/xml-anchor-checking branch 2 times, most recently from 15c2b00 to 88171a4 Compare January 8, 2022 13:36

phillord force-pushed the feature/xml-anchor-checking branch from 88171a4 to a473333 Compare January 8, 2022 13:47

Keats requested changes Jan 8, 2022

View reviewed changes

Fix regexp and refactor

e538e4a

biodranik approved these changes Jan 9, 2022

View reviewed changes

Keats merged commit 2f699f6 into getzola:next Jan 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add heuristic checking for HTML anchors #1716

Add heuristic checking for HTML anchors #1716

phillord commented Jan 3, 2022 •

edited

Loading

biodranik Jan 4, 2022

Keats Jan 4, 2022

phillord Jan 5, 2022

Keats Jan 5, 2022

phillord Jan 5, 2022

Keats Jan 5, 2022

biodranik Jan 6, 2022

phillord Jan 6, 2022

Keats Jan 5, 2022

phillord Jan 6, 2022

Keats Jan 7, 2022

biodranik Jan 6, 2022

phillord Jan 6, 2022

mwcz Jan 6, 2022

phillord Jan 6, 2022

mwcz Jan 6, 2022

phillord Jan 6, 2022

biodranik Jan 6, 2022 •

edited

Loading

phillord Jan 7, 2022

phillord commented Jan 8, 2022

Keats Jan 8, 2022

Keats Jan 8, 2022

Keats commented Jan 9, 2022

biodranik Jan 9, 2022

biodranik Jan 9, 2022

mwcz commented Jan 10, 2022 •

edited

Loading

Add heuristic checking for HTML anchors #1716

Add heuristic checking for HTML anchors #1716

Conversation

phillord commented Jan 3, 2022 • edited Loading

Code changes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

biodranik Jan 6, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phillord commented Jan 8, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Keats commented Jan 9, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mwcz commented Jan 10, 2022 • edited Loading

phillord commented Jan 3, 2022 •

edited

Loading

biodranik Jan 6, 2022 •

edited

Loading

mwcz commented Jan 10, 2022 •

edited

Loading