dtrees/README

dClass pattern format and file structure


 COMMENTS

Comments in the .dtree file are prefixed with '#'. The comment prefix must be
the first character in the line. The first non directive comment in the .dtree
file is used as the index comment and will be stored within the index. Example:

#This is a comment


 COMMENT DIRECTIVES

The dClass index supports several pattern search modes. Modes are prefixed as a
comment directive using the following characters '#$'. dClass supports the
following directive search modes:

#$regex
This allows for regular expression pattern matching.

#$dups
This allows for a single pattern to have multiple pattern definitions.

#$partial
This allows for partial pattern matches in non regex searches.

#$notrim
This does not trim extra whitespace around pattern definitions.

#@error_id
This sets the id for the kv returned on error.

The dClass index supports string caching. Cached strings are only allocated
once and re-referenced there after. The order of strings in this directive (and
the order of pattern definitions) are used to break ties in patterns with the
same rank. The size of the cache is set using DTREE_M_LOOKUP_CACHE. Also, the
first string allocated will be returned as the error id with no key values if
the error id directive is not used.

#!unknown,string1,groupa,string ab3


 PATTERNS

All patterns are defined in the following format:

pattern ; id ; type ; parent ; key1=value , key2=value2

Note that semicolons are used to separate the pattern attributes and commas are
used to separate the key value pairs. By default whitespace is trimmed. There
are two ways to include leading and trailing whitespace. First, you can wrap
your attribute in double quotes. Second, you can set the notrim directive.


pattern

This is the pattern to match. Patterns must begin and end on a token (non
separator) character boundary (DTREE_HASH_TCHARS or less than DTREE_HASH_SEP).
Patterns may contain separator characters. If the regex directive is used, the
following regex is supported:

 .  - any character
 ?  - the preceding character or group is optional
 ^  - the beginning of the text to classify
 $  - the end of the text to classify
 \  - escape, treat the next character as non regex
 [] - a set of characters, only 1 of the set needs to match
 () - a group of characters, the whole group needs to match

The longest pattern is always matched. Token characters are always matched
before regex characters.


id

Unique identifier for the pattern. Patterns with matching ids will share key
values (the first defined key value set will be used). If referenced as a
parent (or defined as a BASE), only one of of the patterns in the id set needs
to match.


type

The following types are supported: S,B,C,W,N.

 S - STRONG pattern. If found, classification is stopped the pattern is returned
     immediately regardless of other patterns found or to be found.
 B - BASE pattern. This pattern is only used as a parent dependency. May itself
     have a parent dependency.
 C - CHAIN pattern. If all parent dependent BASE patterns have been found then
     this pattern is complete. If no rank has been assigned, classification is
     stopped and the pattern is returned immediately. Otherwise it is returned
     at the end of classification if its the highest ranking CHAIN pattern found
     in the absence of a higher ranking WEAK pattern.
 W - WEAK pattern. If found, this pattern is returned at the end of
     classification if its the highest ranking WEAK pattern found in the
     absence of a higher ranking CHAIN pattern.
 N - NONE pattern. If found, this pattern is returned at the end of
     classification in the absence of CHAIN or WEAK patterns.

Some types can take an optional rank parameter. All types can take an optional
position parameter. Example:

type[RANK]:[POS]

CHAIN and WEAK patterns can take an optional rank parameter. Ranks can range
from 1 to 255. The highest ranking CHAIN or WEAK pattern is returned. If a
CHAIN pattern has no rank, its returned immediately without analyzing the rest
of the string. If both a CHAIN and WEAK pattern have the same rank, the CHAIN
pattern is returned.

All patterns can take an optional position parameter. Positions can range from
1 to 254. A position of 1 is equivalent to the '^' regex character. The pattern
must start at the given word position in the classified text (using the defined 
separator chars) to be valid.

Example:

B:2   - the BASE pattern must start at the second word
W50:5 - the WEAK pattern must start at the fifth word and has a rank of 50


parent

id of the BASE pattern which is dependent on this pattern. Only BASE or CHAIN
patterns can have parents. The parent parameter can take an optional direction
parameter which is the maximum word distance the parent can be from the given
pattern. Example:

parent:[DIR]

Directions can range from -127 to 127. If a parent has no direction or a
direction of 0, the parent can be any distance before the given pattern.
Negative directions proceed the given pattern and positive directions follow
the given pattern (ie: -127 to 127). Positive direction parents cannot itself
have any parent dependencies.


key values

Key value pairs which are returned with the pattern when matched. A uniform set
of keys gives the best runtime performance since the key set is reused across
all patterns (an empty keyset does not violate this).


 EXAMPLES

The following is an example .dtree with comments:

#hotdog example
#enable regex, dups, and set unknown as the error id
#$regex
#$dups
#!unknown

#pattern  ;id      ;type  ;parent  ;key=values
cat       ;cat     ;S     ;        ;return=cat
dog       ;dog     ;W10   ;        ;return=dog
hot ?dog  ;hotdog  ;W25   ;        ;return=hotdog
hot       ;hot     ;B     ;        ;
dog       ;hotdogc ;C40   ;hot:-2  ;return=special hot dog
bird      ;bird    ;N     ;        ;return=bird

#end example

The first cat pattern is STRONG meaning when the pattern is found, it is
returned immediately. The next dog pattern is a WEAK pattern of rank 10. The
hotdog pattern after that has an optional space and it has a WEAK rank of 25.
Next is a base pattern with the word hot. This pattern is referenced by a
duplicate dog pattern and the parent (hot) must be within 2 words of this
pattern. This pattern has the highest rank of 40, but still ranks below the
STRONG pattern. Finally, there is a bird pattern which has a NONE type which
means its only returned when found and no other pattern was found.

Here are some example strings and their classification results:

"A bird saw a cat and a dog"     => cat
"A bird saw a dog"               => dog
"A bird saw a hot dog"           => hotdog
"A bird saw a hot nice dog"      => special hot dog
"A bird saw a hotdog"            => hotdog
"A bird saw nothing"             => bird
"A bird saw a hotdog and a cat"  => cat
"A bird saw a hot and nice dog"  => dog
"A girl saw nothing"             => unknown


 TIPS

While it may be tempting to do so, you do not have to put every pattern into a
single .dtree file. Breaking up patterns into their own logical .dtree files has
several advantages and avoids unneeded complexity. For example, if trying to
detect people, places, and things, put all three of these categories into their
own dtree index.

Also take advantage of the schema free data modeling. If a pattern has a low
confidence for accuracy, add a key value to it saying so. When using multiple
indexes, this low confidence pattern will be exposed. Patterns can have values
which signal across indexes like triggering a search into a certain domain of
indexes. Essentially, you can build your own classification language centered
around your data and indexes.