-
Notifications
You must be signed in to change notification settings - Fork 17
/
README
203 lines (144 loc) · 7.56 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
dClass pattern format and file structure
COMMENTS
Comments in the .dtree file are prefixed with '#'. The comment prefix must be
the first character in the line. The first non directive comment in the .dtree
file is used as the index comment and will be stored within the index. Example:
#This is a comment
COMMENT DIRECTIVES
The dClass index supports several pattern search modes. Modes are prefixed as a
comment directive using the following characters '#$'. dClass supports the
following directive search modes:
#$regex
This allows for regular expression pattern matching.
#$dups
This allows for a single pattern to have multiple pattern definitions.
#$partial
This allows for partial pattern matches in non regex searches.
#$notrim
This does not trim extra whitespace around pattern definitions.
#@error_id
This sets the id for the kv returned on error.
The dClass index supports string caching. Cached strings are only allocated
once and re-referenced there after. The order of strings in this directive (and
the order of pattern definitions) are used to break ties in patterns with the
same rank. The size of the cache is set using DTREE_M_LOOKUP_CACHE. Also, the
first string allocated will be returned as the error id with no key values if
the error id directive is not used.
#!unknown,string1,groupa,string ab3
PATTERNS
All patterns are defined in the following format:
pattern ; id ; type ; parent ; key1=value , key2=value2
Note that semicolons are used to separate the pattern attributes and commas are
used to separate the key value pairs. By default whitespace is trimmed. There
are two ways to include leading and trailing whitespace. First, you can wrap
your attribute in double quotes. Second, you can set the notrim directive.
pattern
This is the pattern to match. Patterns must begin and end on a token (non
separator) character boundary (DTREE_HASH_TCHARS or less than DTREE_HASH_SEP).
Patterns may contain separator characters. If the regex directive is used, the
following regex is supported:
. - any character
? - the preceding character or group is optional
^ - the beginning of the text to classify
$ - the end of the text to classify
\ - escape, treat the next character as non regex
[] - a set of characters, only 1 of the set needs to match
() - a group of characters, the whole group needs to match
The longest pattern is always matched. Token characters are always matched
before regex characters.
id
Unique identifier for the pattern. Patterns with matching ids will share key
values (the first defined key value set will be used). If referenced as a
parent (or defined as a BASE), only one of of the patterns in the id set needs
to match.
type
The following types are supported: S,B,C,W,N.
S - STRONG pattern. If found, classification is stopped the pattern is returned
immediately regardless of other patterns found or to be found.
B - BASE pattern. This pattern is only used as a parent dependency. May itself
have a parent dependency.
C - CHAIN pattern. If all parent dependent BASE patterns have been found then
this pattern is complete. If no rank has been assigned, classification is
stopped and the pattern is returned immediately. Otherwise it is returned
at the end of classification if its the highest ranking CHAIN pattern found
in the absence of a higher ranking WEAK pattern.
W - WEAK pattern. If found, this pattern is returned at the end of
classification if its the highest ranking WEAK pattern found in the
absence of a higher ranking CHAIN pattern.
N - NONE pattern. If found, this pattern is returned at the end of
classification in the absence of CHAIN or WEAK patterns.
Some types can take an optional rank parameter. All types can take an optional
position parameter. Example:
type[RANK]:[POS]
CHAIN and WEAK patterns can take an optional rank parameter. Ranks can range
from 1 to 255. The highest ranking CHAIN or WEAK pattern is returned. If a
CHAIN pattern has no rank, its returned immediately without analyzing the rest
of the string. If both a CHAIN and WEAK pattern have the same rank, the CHAIN
pattern is returned.
All patterns can take an optional position parameter. Positions can range from
1 to 254. A position of 1 is equivalent to the '^' regex character. The pattern
must start at the given word position in the classified text (using the defined
separator chars) to be valid.
Example:
B:2 - the BASE pattern must start at the second word
W50:5 - the WEAK pattern must start at the fifth word and has a rank of 50
parent
id of the BASE pattern which is dependent on this pattern. Only BASE or CHAIN
patterns can have parents. The parent parameter can take an optional direction
parameter which is the maximum word distance the parent can be from the given
pattern. Example:
parent:[DIR]
Directions can range from -127 to 127. If a parent has no direction or a
direction of 0, the parent can be any distance before the given pattern.
Negative directions proceed the given pattern and positive directions follow
the given pattern (ie: -127 to 127). Positive direction parents cannot itself
have any parent dependencies.
key values
Key value pairs which are returned with the pattern when matched. A uniform set
of keys gives the best runtime performance since the key set is reused across
all patterns (an empty keyset does not violate this).
EXAMPLES
The following is an example .dtree with comments:
#hotdog example
#enable regex, dups, and set unknown as the error id
#$regex
#$dups
#!unknown
#pattern ;id ;type ;parent ;key=values
cat ;cat ;S ; ;return=cat
dog ;dog ;W10 ; ;return=dog
hot ?dog ;hotdog ;W25 ; ;return=hotdog
hot ;hot ;B ; ;
dog ;hotdogc ;C40 ;hot:-2 ;return=special hot dog
bird ;bird ;N ; ;return=bird
#end example
The first cat pattern is STRONG meaning when the pattern is found, it is
returned immediately. The next dog pattern is a WEAK pattern of rank 10. The
hotdog pattern after that has an optional space and it has a WEAK rank of 25.
Next is a base pattern with the word hot. This pattern is referenced by a
duplicate dog pattern and the parent (hot) must be within 2 words of this
pattern. This pattern has the highest rank of 40, but still ranks below the
STRONG pattern. Finally, there is a bird pattern which has a NONE type which
means its only returned when found and no other pattern was found.
Here are some example strings and their classification results:
"A bird saw a cat and a dog" => cat
"A bird saw a dog" => dog
"A bird saw a hot dog" => hotdog
"A bird saw a hot nice dog" => special hot dog
"A bird saw a hotdog" => hotdog
"A bird saw nothing" => bird
"A bird saw a hotdog and a cat" => cat
"A bird saw a hot and nice dog" => dog
"A girl saw nothing" => unknown
TIPS
While it may be tempting to do so, you do not have to put every pattern into a
single .dtree file. Breaking up patterns into their own logical .dtree files has
several advantages and avoids unneeded complexity. For example, if trying to
detect people, places, and things, put all three of these categories into their
own dtree index.
Also take advantage of the schema free data modeling. If a pattern has a low
confidence for accuracy, add a key value to it saying so. When using multiple
indexes, this low confidence pattern will be exposed. Patterns can have values
which signal across indexes like triggering a search into a certain domain of
indexes. Essentially, you can build your own classification language centered
around your data and indexes.