robots-parse

A lightweight and simple robots.txt parser in node.

Installation

npm install robots-parse

Usage

You can use the module to scan a domain for robots file like in the example below:

const robotsParse = require('robots-parse');

robotsParse('github.com', (err, res) => {
  console.log('Result:', res);
});

You can also use it with promises if the callback is not specified:

import robotsParse from 'robots-parse'

(async () => {
  const res = await robotsParse('github.com');
  console.log('Result:', res);
})().catch(console.error)

Or you can use the built-in parser to parse an existing robots.txt file, for example a local file or a string. The parser works in sync so you don't have to use callback or promises.

const {parser} = require('robots-parse');

request('google.com/robots.txt', (err, res, body) => {
  const object = parser(body);
  console.log(object);
});

Parsing an existing local robots.txt file:

import {parser} from 'robots-parse'

const content = fs.readFileSync('./robots.txt', 'utf-8');
const object = parser(content);

console.log(object);

How it works?

By default the script will get and parse the robots.txt file for a given website or domain and it will search for various rules:

Agents: A user-agent identifies a specific spider. The user-agent field is matched against that specific spider’s (usually longer) user-agent.
Host: Supported by Yandex (and not by Google even though some posts say it does), this directive lets you decide whether you want the search engine to show.
Allow: The allow directive specifies paths that may be accessed by the designated crawlers. When no path is specified, the directive is ignored.
Disallow: The disallow directive specifies paths that must not be accessed by the designated crawlers. When no path is specified, the directive is ignored.
Sitemap: An absolute url that points to a Sitemap, Sitemap Index file or equivalent URL.

It returns, if the robots files were successfully retrieved and parsed, an object containing the properties mentioned above, inside every agent found you will find agent-specific allow and disallow rules, which also will be stored in allow and disallow root properties containing all of them indistinctly.

You can read more about the specifications of the robots file on it's Google Reference Page.

Contributing

Create an issue and describe your idea
Fork the project (https://github.com/b4dnewz/robots-parse/fork)
Create your feature branch (git checkout -b my-new-feature)
Commit your changes (git commit -am 'Add some feature')
Write tests for your code (npm run test)
Publish the branch (git push origin my-new-feature)
Create a new Pull Request

License

MIT © b4dnewz

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
__tests__		__tests__
src		src
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
jest.config.js		jest.config.js
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json
tslint.json		tslint.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

robots-parse

Installation

Usage

How it works?

Contributing

License

About

Releases

Packages

Languages

License

b4dnewz/robots-parse

Folders and files

Latest commit

History

Repository files navigation

robots-parse

Installation

Usage

How it works?

Contributing

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages