Add basic site crawler to find 404s, etc. #321

toolness · 2017-06-26T15:29:45Z

This is a work-in-progress attempt to fix #269, or at least part of it.

Instructions

Generate the site using jekyll build. The _site folder should now contain the latest version of the site.
Run npm run crawl. It will let you know if it found any errors.

Notes

I decided to use node-simplecrawler because I've had experience using it in the past, and it seems reasonably fast and extensible.

Unlike 18F/content-guide#132, I decided not to go with html-proofer because it's Ruby-based, and I don't have a ton of experience with Ruby. I also heard rumors that we might consider migrating this site from Jekyll to Hugo; if that happens, we would likely remove Ruby entirely from this project, so I figured node might be a safer long-term option. We can always switch, though!

To do

Some of these can be filed as separate issues and dealt with in separate PRs.

Consider making the crawler crawl more than just the index page and all the immediate resources it links to.
Fix 404s reported by the tool, or log warnings if the 404s come from external sources. Currently these seem to be:
- ~~All the JS scripts referenced by ie-polyfill-scripts.html.~~
- ~~html5shiv.js, which is in an IE-only comment in head.html.~~
- ~~In the developer's guide, there's a broken link to CONTRIBUTING.md.~~
Run the crawler as part of npm test (and therefore during CI).
Don't report errors until we're done w/ the crawl; this way our list of referrers is accurate.

toolness · 2017-06-26T20:48:02Z

@donjo any idea what to do about these "less than IE9"-specific JS files that don't seem to exist? Do we even support < IE9 anymore? If not, I guess we can just get rid of the references...

toolness · 2017-06-27T16:02:34Z

Hmm, @donjo just mentioned on slack that our documentation page on accessibility mentions that we only support IE9 and above.

toolness · 2017-06-27T17:54:16Z

So I'm finding that a number of 404s are actually coming from external sources, like WHO_IS_USING_USWDS.md and the release notes, which are stored in the other repository. For these I'm thinking maybe we should log warnings but not errors, so that a broken link caused by one of those files doesn't break the whole build.

donjo · 2017-06-27T17:55:19Z

➕ to warnings for external source 404s. Agree.

jseppi

Looks good to me, and runs well!

jseppi · 2017-06-29T14:49:32Z

config/crawl.js

+        const isWarning = refs.every(path => WARNING_PAGES.includes(path));
+        const label = isWarning ? WARNING : ERROR;
+
+        console.log(`${label}: 404 for ${item.path}!`);


leaving off the ! would clean up the output a bit, and make it easier to copy the path

jseppi · 2017-06-29T14:56:58Z

config/crawl.js

+
+const app = express();
+
+app.use(express.static(`${__dirname}/../_site`));


Maybe check that _site has contents, otherwise the call to refs.every(...) later on (and potentially other things) will fail.

Here's output from running yarn crawl without first running yarn build:

/Users/jamesseppi/CODE/web-design-standards-docs/config/crawl.js:67 const isWarning = refs.every(path => WARNING_PAGES.includes(path)); ^ TypeError: Cannot read property 'every' of undefined at notFound.forEach.item (/Users/jamesseppi/CODE/web-design-standards-docs/config/crawl.js:67:31)

toolness · 2017-06-29T15:25:56Z

Good suggestions @jseppi! Just incorporated them.

@donjo I think this is good to merge if you are OK with the functionality!

donjo

lgtm 🚢

toolness · 2017-06-29T16:59:32Z

woooooooot!

toolness added 5 commits June 26, 2017 11:23

Add basic site crawler to find 404s, etc.

2bdd8f8

Increase maxDepth to 3.

c72f95b

Ignore HREFs with '"' in them.

7e95c2b

Squelch false positives.

9018c84

Increase maxDepth to 99.

375f5e7

Fix broken link to CONTRIBUTING.md.

bd9df04

toolness added 2 commits June 27, 2017 12:08

Report errors at end of run.

775526a

Get rid of nonexistent <IE9 support script references.

03bf1ca

toolness added 4 commits June 29, 2017 07:29

Factor out shouldFetch function.

718b695

Log warnings for 404s from external sources.

c06c0e1

Add 'npm run crawl' cmd, run it on 'npm test'.

5d55aad

Use node 6 on circleCI.

b0e279a

toolness changed the title ~~[WIP] Add basic site crawler to find 404s, etc.~~ Add basic site crawler to find 404s, etc. Jun 29, 2017

Colorize output.

3004e80

toolness requested a review from jseppi June 29, 2017 13:22

This was referenced Jun 29, 2017

Fix broken link to archives.gov. uswds/uswds#2000

Merged

Relative links from external markdown files included in the docs are broken #324

Closed

jseppi reviewed Jun 29, 2017

View reviewed changes

toolness added 2 commits June 29, 2017 11:23

Gracefully fail if site path doesn't exist.

cb14d7c

Removed exclamation mark from console.log.

8d66da4

toolness requested a review from donjo June 29, 2017 15:26

toolness mentioned this pull request Jun 29, 2017

Add site tests #269

Closed

5 tasks

donjo approved these changes Jun 29, 2017

View reviewed changes

toolness merged commit f0a0515 into develop Jun 29, 2017

toolness deleted the crawler branch June 29, 2017 16:59

toolness mentioned this pull request Jul 3, 2017

Formally abandon Internet Explorer < 9 uswds/uswds#1705

Closed

This was referenced Jul 12, 2017

Run aXe on a few pages during CI #344

Merged

[WIP] Add {% fractal_component %} liquid template tag #357

Merged

This was referenced Aug 4, 2017

Add a link checker 18F/pclob#44

Closed

Add optional link checker 18F/pclob#47

Merged

toolness mentioned this pull request Sep 6, 2017

Crawl site for 404s and run aXe during CI 18F/uswds-data#1

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add basic site crawler to find 404s, etc. #321

Add basic site crawler to find 404s, etc. #321

toolness commented Jun 26, 2017 •

edited

Loading

toolness commented Jun 26, 2017

toolness commented Jun 27, 2017

toolness commented Jun 27, 2017

donjo commented Jun 27, 2017

jseppi left a comment

jseppi Jun 29, 2017

jseppi Jun 29, 2017

toolness commented Jun 29, 2017

donjo left a comment

toolness commented Jun 29, 2017


		const app = express();

		app.use(express.static(`${__dirname}/../_site`));

Add basic site crawler to find 404s, etc. #321

Add basic site crawler to find 404s, etc. #321

Conversation

toolness commented Jun 26, 2017 • edited Loading

Instructions

Notes

To do

toolness commented Jun 26, 2017

toolness commented Jun 27, 2017

toolness commented Jun 27, 2017

donjo commented Jun 27, 2017

jseppi left a comment

Choose a reason for hiding this comment

jseppi Jun 29, 2017

Choose a reason for hiding this comment

jseppi Jun 29, 2017

Choose a reason for hiding this comment

toolness commented Jun 29, 2017

donjo left a comment

Choose a reason for hiding this comment

toolness commented Jun 29, 2017

toolness commented Jun 26, 2017 •

edited

Loading