Skip to content

A lightweight and simple robots.txt parser in node

License

Notifications You must be signed in to change notification settings

b4dnewz/robots-parse

Repository files navigation

robots-parse

NPM version Build Status Dependency Status Coverage percentage

A lightweight and simple robots.txt parser in node.

NPM

Installation

npm install robots-parse

Usage

You can use the module to scan a domain for robots file like in the example below:

const robotsParse = require('robots-parse');

robotsParse('github.com', (err, res) => {
  console.log('Result:', res);
});

You can also use it with promises if the callback is not specified:

import robotsParse from 'robots-parse'

(async () => {
  const res = await robotsParse('github.com');
  console.log('Result:', res);
})().catch(console.error)

Or you can use the built-in parser to parse an existing robots.txt file, for example a local file or a string. The parser works in sync so you don't have to use callback or promises.

const {parser} = require('robots-parse');

request('google.com/robots.txt', (err, res, body) => {
  const object = parser(body);
  console.log(object);
});

Parsing an existing local robots.txt file:

import {parser} from 'robots-parse'

const content = fs.readFileSync('./robots.txt', 'utf-8');
const object = parser(content);

console.log(object);

How it works?

By default the script will get and parse the robots.txt file for a given website or domain and it will search for various rules:

  • Agents: A user-agent identifies a specific spider. The user-agent field is matched against that specific spider’s (usually longer) user-agent.
  • Host: Supported by Yandex (and not by Google even though some posts say it does), this directive lets you decide whether you want the search engine to show.
  • Allow: The allow directive specifies paths that may be accessed by the designated crawlers. When no path is specified, the directive is ignored.
  • Disallow: The disallow directive specifies paths that must not be accessed by the designated crawlers. When no path is specified, the directive is ignored.
  • Sitemap: An absolute url that points to a Sitemap, Sitemap Index file or equivalent URL.

It returns, if the robots files were successfully retrieved and parsed, an object containing the properties mentioned above, inside every agent found you will find agent-specific allow and disallow rules, which also will be stored in allow and disallow root properties containing all of them indistinctly.

You can read more about the specifications of the robots file on it's Google Reference Page.


Contributing

  1. Create an issue and describe your idea
  2. Fork the project (https://github.com/b4dnewz/robots-parse/fork)
  3. Create your feature branch (git checkout -b my-new-feature)
  4. Commit your changes (git commit -am 'Add some feature')
  5. Write tests for your code (npm run test)
  6. Publish the branch (git push origin my-new-feature)
  7. Create a new Pull Request

License

MIT © b4dnewz

About

A lightweight and simple robots.txt parser in node

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published