-
-
Notifications
You must be signed in to change notification settings - Fork 33
Document Processors
Document processors are the heart of what JesterJ is about. Everything else in the system is meant to procure data and feed it to the document processors in an orderly fashion. JesterJ aims to provide useful document processors that when used in combination can serve most common use cases. When unique use cases arrive it should also be easy to create custom processors.
This should be easy. If it's not please file an issue here in github! All you need to do is implement org.jesterj.ingest.model.DocumentProcessor
and pass an instance of your processor to StepImpl.Builder.withProcessor()
. That is the only hard requirement right now, but here are some suggestions for smoother operation and adaptability to future releases:
- If a document errors out it should do
doc.setStatus(Status.ERROR)
and return it from the method. Logging the status change and not sending it to the next step is handled by the framework. - If a document should be dropped, again
doc.setStatus(Status.DROP)
and return it fromprocessDocument()
- Your processor should either be stateless or immutable if possible. There will be lots of threads out there once we implement thread pool scaling and it's far easier to be immutable than thread safe.
- For future compatibility you will be best served by creating a builder pattern similar to the existing DocumentProcessor implementations, StepImpl, PlanImpl etc. In the future releases we may serialize the builders and send them to other nodes so that those nodes can do work too.
A template for creating a document processor implementation is available here
These processors massage data within a field, or move data among fields.
Copies data from an existing field to a new field creating it if necessary, or adding/overwriting depending on configuration.
Remove a field and all it's values from the docucment
Interpret the value of a field as a Apache Velocity template using the document as context. If the field has
multiple values all values will be interpreted and replaced. It's also important to remember that
the fields being referenced can contain multiple values, so one usually wants to write $foobar[0]
, not
$foobar
. The latter will lead to replacement with the string value of the list containing the values. In
other words [foo]
if only one value or [foo,bar,baz]
if 3 values are presently held in the field.
WARNING: this uses the velocity templating engine which is a powerful, but potentially dangerous technique!! You want to ensure that the template is NOT derived from and does NOT CONTAIN any text that is provided by users or other untrustworthy sources before it is interpreted by this processor. If you allow user data to be interpreted as a template, you have given the user the ability to run ARBITRARY code on the ingestion infrastructure. Recommended usages include specifying the template field as a statically defined field, or drawn from a known controlled and curated database containing templates. Users are also strongly cautioned against chaining multiple instances of this step, since it becomes exponentially more difficult to ensure user controlled data is not added to the template and then subsequently interpreted. With great power comes great responsibility. Don't run with scissors... you have been warned!
Reads a field value, and replaces all matches of the supplied regular expression
Sets readable file size field values, as follows:
- Reads the value of the specified input field, interprets it as a number, determines its magnitude and expresses it as bytes, KB, MB, GB or TB;
- Provides options to write a combined field ("200 KB"), a units field ("KB"), and/or a numeric field ("200"). If the size is over 1 GB, the size is returned as the number of whole GB, i.e. the size is rounded down to the nearest GB boundary. Similarly for the 1 MB and 1 KB boundaries.
Takes an input field name and an output field name and parses the input with a date format string (as per Java's DateTimeFormatter class) and then formats the result using an output format (DateTimeFormatter.ISO_INSTANT by default)
Converts the current value or values to multiple values by splitting them on a delimiter
Applies URLEncoder.encode(value, enc) to the values of the field and replaces the values with the encoded result. The encoding may be specified when this processor is configured.
These processors acquire additional data to be added to the document.
Interprets a specified field as a URL and issues a GET request to fetch a document at a particular URL. The result is set as a string value in a field in the document. Can be configured to throttle outgoing requests to avoid causing denial of service on the destination site. Uses a simple URL connection and has no facility for adding Headers etc. For more complicated scenarios such as authentication, a custom processor implemented with Apache HTTP Client is reccomended.
These processors analyze the raw data read by a scanner and use it to populate many document fields. They also may manipulate the cached copy of the data originally read by the processor.
A memory efficient processor for extracting information from XML documents. The raw data for the document is parsed with a StaxProcessor. Data is extracted from the xml and mapped to fields in the document. Simple cases can be specified in a fashion similar to XPath but with much less complicated syntax. More complicated cases involving attributes can be achieved with an ElementSpec instance and full control below the match point can be exercised by supplying a custom StaxExtractinProcessor.LimitedStaxHandlerFactory() implementation to the ElementSpec. See the unit test for examples:
Extracts text and metadata from the document's data via Apache Tika. Can be provided with full Tika configuration using tika's XML configuration format. Extracted metadata are added to the document as fields with a configurable suffix and the extracted text replaces the raw data in the Document object.
These processors relate to the flow of data through the ingestion DAG.
This processor is for any case where you want to explicitly drop a document based on the results of the logic in a router. The most likely use case for this is as one of the destinations for a RouteByStepName router if the value of a field can be used to determine that the document is not useful for your search index.
This processor is similar to LogAndDrop except the result will be an error. JesterJ will attempt to reprocess the document and so usually this is only useful for testing error scenarios.
These processors prepare or send data to Apache Solr
This processor allows you to move the vast majority of the work that solr does when indexing OUT of solr. If you have a very heavy analysis phase using this processor could keep your index responsive to queries even under heavy indexing load, because the heavy load will slow down JesterJ not solr!
The configuration for this processor allows you to provide your schema.xml (or current managed_schema) to JesterJ. JesterJ then uses Solr's classes to read the schema and pre-populate the fields with Solr's preanalyzed json data. See https://solr.apache.org/guide/solr/latest/indexing-guide/external-files-processes.html#the-preanalyzedfield-type for further information.
Note that if you are using a different version of Solr than JesterJ, you may want to copy the source for this class into your project, and package your version of Solr into an UnoJar to ensure no compatibility issues.
This is how you get documents into Solr! This is typically the final step in the primary path through your ingestion plan. This processor provides batching to ensure solr is used efficiency, and it has an automatic fallback when a batch fails, that re-sends the documents individually to ensure that as many documents as possible make it into your index, and it's easy to determine which documents actually have problems.
These classes may be useful in writing your own processors
Extend this class if you want to write your own processor that sends documents somewhere else (like Elastic, or Algolia, Coveo, or any other system that benefits from receiving data in baches).
test edit