Skip to content
chriso edited this page Apr 10, 2012 · 32 revisions

Data scraping and processing code is organised into modular and extendable jobs written in JavaScript or CoffeeScript. A typical node.io job consists of taking some input, processing / reducing it in some way, and then outputting the emitted results, although no step is compulsory. Some scraping jobs don't require input, etc.

Running a job

Jobs can be run from the command line or through a web interface. To run a job from the command line (extension can be omitted), run

$ node.io myjob

Running a job from within another script

Use nodeio.start(job, options, callback, capture_output).

A job usually defines its own output method, but if you need to capture the output and return it to the callback, set capture_output to true. Note that callback takes (err, output) or (err) if not capturing output.

Debugging a job

Sometimes a job may display incorrect behavior. To find out why and see what's going on under the hood, use the -g or --debug switch

$ node.io --debug myjob

The anatomy of a job

EDIT: As of NPM v1.0.0, you might get an issue where node cannot find node.io. Try running npm install node.io in the same directory as the job you're trying to run.

Example 1: Hello World!

hello.js

var nodeio = require('node.io');
exports.job = new nodeio.Job({
    input: false,
    run: function () {
        this.emit('Hello World!');
    }
});

hello.coffee

nodeio = require 'node.io'
class Hello extends nodeio.JobClass
    input: false
    run: (num) -> @emit 'Hello World!'
    
@class = Hello
@job = new Hello()

To run the example

$ node.io -s hello
     => Hello World!

Note: the -s switch omits status messages from output => same as appending 2> /dev/null

Example 2: Double each element of input

double.js

var nodeio = require('node.io');
exports.job = new nodeio.Job({
    input: [0,1,2],
    run: function (num) {
        this.emit(num * 2);
    }
});

double.coffee

nodeio = require 'node.io'
class Double extends nodeio.JobClass
    input: [0,1,2]
    run: (num) -> @emit num * 2
    
@class = Double
@job = new Double()

Example 3: Inheritance

quad.js

var Double = require('./double').job;

exports.job = Double.extend({
    run: function (num) {
        Double.run.call(this, num * 2);
        //Same as: this.emit(num * 4)
    }
});

quad.coffee

Note: CoffeeScript inheritance with multiple files is temporarily broken in the latest release.. A fix is coming soon! Classes that are defined in the same file are fine:

nodeio = require 'node.io'
class Double extends nodeio.JobClass
    input: [0,1,2]
    run: (num) -> @emit num * 2

class Quad extends Double
    run: (num) -> super num * 2
    
@class = Quad
@job = new Quad()

Basic concepts

Job options

Options allow you to easily incorporate common or complex behavior. A full list of options can be found in the API.

Options are specified as an object containing key/value pairs

var nodeio = require('node.io');
var options = {
    timeout: 10,    //Timeout after 10 seconds
    max: 20,        //Run 20 threads concurrently (when run() is async)
    retries: 3      //Threads can retry 3 times before failing
};
exports.job = new nodeio.Job(options, methods);

Determining when a job is complete

Being asynchronous, node.io needs to be able to determine when each thread (a call to run()) is complete, and when the entire job is complete.

A thread is complete after:

  • emit(), fail(), retry() or skip() has been called - any subsequent calls in the same thread are ignored
  • An option, such as timeout, causes the thread to automatically call one of the methods above
  • run() returns something other than null - in this case, the return value is emitted

** Important: if one of the above conditions is not met, the thread will hang indefinitely **

The job is complete when:

  • All of the input has been consumed, or in the case of input: false, when one thread has completed
  • exit() is called

Passing arguments to jobs

Sometimes it may be desirable to be able to specify arguments to a job, e.g.

$ node.io myjob arg1 arg2 arg3

Arguments can be accessed through this.options.args, e.g.

run: function() {
    console.log(this.options.args[0]); //"arg1"
}

Retrying, skipping or failing a thread

To retry or skip a thread, use the retry() or skip() methods (no arguments required), e.g. to remove empty lines

remove_empty_lines.js

var nodeio = require('node.io');
exports.job = new nodeio.Job({
    run: function(line) {
        if (line.trim() == '') {
            this.skip()
        } else {
            this.emit(line)
        }
    }
});

Some job options (timeout, retries, redirects) cause fail() to be called automatically after some condition

var nodeio = require('node.io');
exports.job = new nodeio.Job({timeout: 5}, {
    run: function(input) {
        //There are no conditions that would cause this thread to be marked as complete, so it will timeout after 5 seconds
    },
    fail: function (input, status) { 
        //status = "timeout"
        this.emit('Thread failed'); //You still need to complete the thread with an emit or skip, etc.
    }
});

Goto part 2: Working with input / output

Goto part 3: Scraping data from the web

Goto part 4: Data validation and sanitization