Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add basic site crawler to find 404s, etc. #321

Merged
merged 15 commits into from
Jun 29, 2017
Merged

Add basic site crawler to find 404s, etc. #321

merged 15 commits into from
Jun 29, 2017

Conversation

toolness
Copy link
Contributor

@toolness toolness commented Jun 26, 2017

This is a work-in-progress attempt to fix #269, or at least part of it.

Instructions

  1. Generate the site using jekyll build. The _site folder should now contain the latest version of the site.

  2. Run npm run crawl. It will let you know if it found any errors.

Notes

I decided to use node-simplecrawler because I've had experience using it in the past, and it seems reasonably fast and extensible.

Unlike 18F/content-guide#132, I decided not to go with html-proofer because it's Ruby-based, and I don't have a ton of experience with Ruby. I also heard rumors that we might consider migrating this site from Jekyll to Hugo; if that happens, we would likely remove Ruby entirely from this project, so I figured node might be a safer long-term option. We can always switch, though!

To do

Some of these can be filed as separate issues and dealt with in separate PRs.

  • Consider making the crawler crawl more than just the index page and all the immediate resources it links to.
  • Fix 404s reported by the tool, or log warnings if the 404s come from external sources. Currently these seem to be:
    • All the JS scripts referenced by ie-polyfill-scripts.html.
    • html5shiv.js, which is in an IE-only comment in head.html.
    • In the developer's guide, there's a broken link to CONTRIBUTING.md.
  • Run the crawler as part of npm test (and therefore during CI).
  • Don't report errors until we're done w/ the crawl; this way our list of referrers is accurate.

@toolness
Copy link
Contributor Author

@donjo any idea what to do about these "less than IE9"-specific JS files that don't seem to exist? Do we even support < IE9 anymore? If not, I guess we can just get rid of the references...

@toolness
Copy link
Contributor Author

Hmm, @donjo just mentioned on slack that our documentation page on accessibility mentions that we only support IE9 and above.

@toolness
Copy link
Contributor Author

So I'm finding that a number of 404s are actually coming from external sources, like WHO_IS_USING_USWDS.md and the release notes, which are stored in the other repository. For these I'm thinking maybe we should log warnings but not errors, so that a broken link caused by one of those files doesn't break the whole build.

@donjo
Copy link
Contributor

donjo commented Jun 27, 2017

➕ to warnings for external source 404s. Agree.

@toolness toolness changed the title [WIP] Add basic site crawler to find 404s, etc. Add basic site crawler to find 404s, etc. Jun 29, 2017
Copy link
Contributor

@jseppi jseppi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, and runs well!

config/crawl.js Outdated
const isWarning = refs.every(path => WARNING_PAGES.includes(path));
const label = isWarning ? WARNING : ERROR;

console.log(`${label}: 404 for ${item.path}!`);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leaving off the ! would clean up the output a bit, and make it easier to copy the path

config/crawl.js Outdated

const app = express();

app.use(express.static(`${__dirname}/../_site`));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe check that _site has contents, otherwise the call to refs.every(...) later on (and potentially other things) will fail.

Here's output from running yarn crawl without first running yarn build:

/Users/jamesseppi/CODE/web-design-standards-docs/config/crawl.js:67
        const isWarning = refs.every(path => WARNING_PAGES.includes(path));
                              ^

TypeError: Cannot read property 'every' of undefined
    at notFound.forEach.item (/Users/jamesseppi/CODE/web-design-standards-docs/config/crawl.js:67:31)

@toolness
Copy link
Contributor Author

Good suggestions @jseppi! Just incorporated them.

@donjo I think this is good to merge if you are OK with the functionality!

@toolness toolness requested a review from donjo June 29, 2017 15:26
@toolness toolness mentioned this pull request Jun 29, 2017
5 tasks
Copy link
Contributor

@donjo donjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm 🚢

@toolness
Copy link
Contributor Author

woooooooot!

@toolness toolness merged commit f0a0515 into develop Jun 29, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants