-
-
Notifications
You must be signed in to change notification settings - Fork 33
Plan Providers
The primary means of configuring JesterJ is by building up a document processing plan. This plan is represented as a very simple Java class, designed to have one method and this method is normally organized into 4 parts
There are 3 basic things to do in a plan provider.
- Configure the steps you will need using fluent/builder apis
- String your steps together by defining the predecessor(s) of each step using planBuilder.addStep()
- Call planBuilder.build() and return the result
You can see an example of this here:
Notice that the key method is organized as follows:
- Step builder instantiations
- Step builder configuration, including addition of builders for Processors at each step.
- Addition of steps to the plan with specification of predecessor steps
This organization is recommended, but not required. The organization of this method may be varied in any fashion that is valid java syntax.
All builders support a .named(String)
method and a unique textual name should be provided for each. These names are required to match the regular expression ^[A-Za-z][\w.]*$
(must start with a letter and thereafter contain only letters/numbers/dashes/underscores/dots note that spaces are NOT allowed)
When Plan.build()
is called the structure of the plan is created, and checked for cycles (any path that can lead back to the same node). Cycles (loops) are not allowed, and the build should fail if they are found. ANY other legal Directed Acyclic Graph is allowed so long as
- All tree roots (starting points) begin with implementations of Scanner
- All leaves (final steps) are configured with potent or idempotent processors
Note that it is fully legal and supported to have several sources feeding completely disconnected paths in the same plan.
Sometimes it is difficult to trouble shoot the construction of your DAG so JesterJ has a built in way to visualize the structure. This allows you to verify that the structure you think you specified is actually the structure you did specify.
If you run java -jar jesterj-node-1.0-beta2.jar -z viz.png example-shakespeare-1.0-beta2.jar NODENAME NODEPASS
This produces a png visualization of the plan that looks like this:
As of 1.0 plan performance is significantly impacted by the number of scanners and the number of steps that must be tracked as output destinations. (Potent or Idempotent steps). So for example, trying to increase output by duplicating sender steps may actually decrease performance. The following diagram with batch size of 1000 was about 3x slower than a single solr_sender step with batch size of 5000.
In theory this should be possible, but you must compile the plan to byte code, assemble it into a .jar
file and the the byte code must result in a class that has the @JavaPlanConfig annotation. Assuming that these requirements can be satisfied, it is theoretically possible to write a plan in:
- Jython
- JRuby
- Scala
- Kotlin
- Groovy
If you attempt this please share the results in the General Section of Discussions. At some point in the future we would like to have the ability to dynamically consume and interpret some of these languages so contributions welcome there too!