Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add heuristic checking for HTML anchors #1716

Merged
merged 2 commits into from
Jan 10, 2022

Conversation

phillord
Copy link
Contributor

@phillord phillord commented Jan 3, 2022

As a starter for discussion, not finished yet.

Previously only anchors specified or generated in markdown could be
linked to, without complaint from the link checker. We now use a
simple heuristic check for name or id attributes.

Addresses #1707

IMPORTANT: Please do not create a Pull Request adding a new feature without discussing it first.

The place to discuss new features is the forum: https://zola.discourse.group/
If you want to add a new feature, please open a thread there first in the feature requests section.

Sanity check:

  • Have you checked to ensure there aren't other open Pull Requests for the same update/change?

Code changes

(Delete or ignore this section for documentation changes)

  • Are you doing the PR on the next branch?

If the change is a new feature or adding to/changing an existing one:

  • Have you created/updated the relevant documentation page(s)?

@@ -0,0 +1,16 @@
pub fn anchor_id_checks(anchor:&str) -> Vec<String> {
vec![
format!(" id={}", anchor),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will regex be slower or faster here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, this is a lot of allocation, i'm guessing the regex would be faster? But hard to say without a benchmark

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly faster and probably more maintainable. I just copied what was already there, but happy to refactor it out to use a regexp. This code only runs if other anchor checks fail, so any performance hit will only be on sites that are broken or have these anchors in.

The regexp crate is already a dependency, so nothing wrong with trying. Am guessing we'd need a few hundred links to make a benchmark informative.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's go with the regex approach then

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, pushed. Performance testing results.

I compiled site with a two small files but one contain ~2k links to an XML anchor. I then tested various options, in all cases using cargo run but taking the time reported at the end.

Comparing regexp to direct string matching

regexp:
   --release: 1.3s
   no --release: 39s

non-regexp:
   --release: 1.0s
   no --release: 19.4s

Conclusion: Regexp is slower esp for non-optimized builds.

Then comparing the non-regexp implementation with heuristic_link_check false or true

heuristic_link_check = false --release   0.5s 
heuristic_link_check = false no --release 1.1s

Conclusion: The heuristic link check causes a 100% in build times (for release build) and a much bigger increase for non-optimized.

However, in this case zola produces lots of error messages and printouts can be slow. So, I have also tried fixing the broken links by adding an markdown generated anchor (with the same ~2k links to it). I also tried comparing this situation with and without the heuristic link checker.

--release, no failures: 0.5s
no -- release, no failures: 1.1s

heuristic_link_check = false --release no failures 0.5s
heuristic_link_check = false no --release no failures 1.1s

Conclusion 1: printing errors doesn't make much of a difference to the performance.
Conclusion 2: if the heuristic link checker is on, it does not make that much difference if most links are not broken. This would be predicted from the code.

So, I would say the regexp code is a bit neater, the multiple format! checking is a bit faster, so either implementation is fine. I'd probably got with the regexp cause neat trumps fast for me, but happy with what ever you prefer.

And while the heuristic_link_checker is significantly slows things down, you only pay for it if you are using XML anchors. As zola current refuses to build a site under these circumstances, it seems a reasonable price to pay.


pub fn anchor_id_checks(anchor:&str) -> Regex {
Regex::new(
&format!(r#" (id|ID|name|NAME) *= *("|'){}("|')"#, anchor)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're missing the ones where they have no quotes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also make regex case-insensitive to handle Name, iD, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now done, hopefully, with any obscure side effects.

@@ -130,6 +130,9 @@ skip_anchor_prefixes = [
"https://caniuse.com/",
]

# Check for links to anchors defined in HTML rather than markdown
heuristic_link_check = true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the end since it's only ran on local failures, it's probably fine to not need an option

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your call, but it would give the option of disabling something that can produce failures. Switching the default around might be easier?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's already the workaround of creating the link in HTML directly


!(page.has_anchor(anchor)||
site.config.link_checker.heuristic_link_check &&
page.has_anchor_id(anchor))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is regex created for each page? Would caching it make processing faster?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For each anchor and each page. I would guess that it would only make a minor difference in most cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think caching the regex wouldn't be as useful because the regex pattern contains the text of each anchor, so the likelihood any particular regex would be needed again is slim.

Unless the method was flipped around, so there was a single cached regex behind a mutex, with a pattern like r"(?i) (id|name) *= *"[^"]"? " (case insesitively match all name or id attributes (I used only double quoting for simplicity)). Then there's only one regex, which can find all matches in a given page, then you can check those matches for the anchor in question.

I think there's some chance that would be faster than creating so many fresh regexes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mwcz I would agree that this might be faster. Then I thought regexs would be faster than exact string matching and was wrong. We don't have a good test realistic test case for any kind of real world benchmarking. My natural inclination would be to leave the simplest implementation till it's clear that this is a problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure, +1 to keeping it simple at first.

For benchmarking, I might be able to help with that. The site I'm starting to build with zola has 130,000 pages of markdown, with a lot of name/id anchors, so it would be highly sensitive to any performance issues. I can run benchmarks myself with that content. Maybe if I can find a way to scrub the content (lorem-ipsumify it but keep the markdown layout) then I could share it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mwcz That would be a good benchmark but only to see it it this code does not affect it at all. This code only runs if the current link checker fails. So if your current site builds (i.e. all the links work), then this code should never run. That's a useful thing to check, but doesn't give any indication of the performance of this code. I'm assuming your anchors here are either zola generated section slugs or manually specified anchors.

What you need could do is take half the markdown files, turn them into header/footer-less HTML with pandoc or equivalent. Now you should have 65,000 markdown files with pointers to 65,000 HTML files, so half of your outlinks would be to XML specified anchors and therefore use this code.

Quite how we would interpret the results I don't know. This check is surely slower than checking for markdown anchors which have to be parsed out anyway. So, it's not clear what would be an acceptable slowdown.

Copy link
Contributor

@biodranik biodranik Jan 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fastest implementation would be integrated into the markdown=>HTML convertor, where it's easy to collect all ids/links at the conversion stage and then to finally check at the end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@biodranik Who knows what would be fastest, without a good test. Collecting all the ids/links would potentially mean checking a lot more links, so it might be slower. Of course, it does mean that the link checker could check all links. At the moment HTML -> HTML links that are broken would not be picked up.

It's a question of what the link checker is there for. I think it's fine that it is limited to markdown links, so long as it does not preclude me putting in HTML anchors.

@phillord phillord changed the base branch from master to next January 8, 2022 13:26
@phillord phillord force-pushed the feature/xml-anchor-checking branch 2 times, most recently from 15c2b00 to 88171a4 Compare January 8, 2022 13:36
Previously only anchors specified or generated in markdown could be
linked to, without complaint from the link checker. We now use a
simple heuristic check for `name` or `id` attributes.

Duplicate code has been refactored and all XML anchor checks updated
to use regex rather than substring match.
@phillord
Copy link
Contributor Author

phillord commented Jan 8, 2022

Think this is ready to go now!


pub fn anchor_id_checks(anchor:&str) -> Regex {
Regex::new(
&format!(r#" (?i)(id|ID|name|NAME) *= *("|')*{}("|'| |>)+"#, anchor)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're setting the ?i flag, we can get rid of the multiple cases in the first part of the regex

@@ -28,6 +29,12 @@ pub fn has_anchor(headings: &[Heading], anchor: &str) -> bool {
false
}


pub fn has_anchor_id(content: &str, anchor: &str) -> bool {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably move this function to utils directly since we will only ever do is_match

@Keats
Copy link
Collaborator

Keats commented Jan 9, 2022

That looks ok to me, anyone else having comments?

@@ -9,6 +9,7 @@ include = ["src/**/*"]
tera = "1"
unicode-segmentation = "1.2"
walkdir = "2"
regex="1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: space


// Case variants
assert!(m(r#"<a ID="fred">"#));
assert!(m(r#"<a iD="fred">"#));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add some other tags, so readers can easily understand, that the code also works with any element with id?

@mwcz
Copy link
Contributor

mwcz commented Jan 10, 2022

That looks ok to me, anyone else having comments?

Looks good to me. It works well in my two trivial test case and in @mwcz's Giant Site, which had 493 broken anchors in 0.15.2 and with this patch is down to 209 (which seem to be real). And there was no noticable performance difference at all.

@Keats Keats merged commit 2f699f6 into getzola:next Jan 10, 2022
Keats pushed a commit that referenced this pull request Jan 22, 2022
* Add heuristic checking for HTML anchors

Previously only anchors specified or generated in markdown could be
linked to, without complaint from the link checker. We now use a
simple heuristic check for `name` or `id` attributes.

Duplicate code has been refactored and all XML anchor checks updated
to use regex rather than substring match.

* Fix regexp and refactor
Keats pushed a commit that referenced this pull request Jan 23, 2022
* Add heuristic checking for HTML anchors

Previously only anchors specified or generated in markdown could be
linked to, without complaint from the link checker. We now use a
simple heuristic check for `name` or `id` attributes.

Duplicate code has been refactored and all XML anchor checks updated
to use regex rather than substring match.

* Fix regexp and refactor
thomasantony pushed a commit to thomasantony/zola that referenced this pull request Sep 17, 2022
* Add heuristic checking for HTML anchors

Previously only anchors specified or generated in markdown could be
linked to, without complaint from the link checker. We now use a
simple heuristic check for `name` or `id` attributes.

Duplicate code has been refactored and all XML anchor checks updated
to use regex rather than substring match.

* Fix regexp and refactor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants