This repository has been archived by the owner on Mar 25, 2024. It is now read-only.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
I implemented a new pattern matching library, based on the insights from my bachelor thesis.
It can be used to match phrases, words and math formulae.
At the moment, the patterns are written as an XML file.
To test the new library, I implemented a new declaration spotter.
The PR also adds missing support for XML namespaces to the serialization code.
The Pattern Language
The patterns are written in an XML file (as in this example).
A pattern file essentially contains a list of rules, which can reference each other.
I will try to provide an overview of how these rules look like. One day I might write a proper documentation.
Here is an example rule:
This creates a rule for matching words. It has a name so that we can reference it later.
The
meta
node is optional and currently does not support much metadata.Afterwards, we have the actual pattern that is matched by this rule. In this case, it is a
word_or
pattern, which matches a word, if any of the contained word patterns matches.Here is a second word rule, referencing this rule:
There exist the following types of rules:
mtext_rule
for matching the symbols inmath
nodesmath_rule
for matchingmath
nodes (or parts of them)pos_rule
for matching part-of-speech (POS) tagsword_rule
for matching wordsseq_rule
for matching sequence of wordsHere is a more advanced example of two math rules that match an identifier using mutual recursion:
For consistency, every pattern starts with a prefix, denoting what it matches. The only exception is the
phrase
pattern. It obviously matches sequences of words. Here is another example pattern that illustrates how thephrase
pattern can be used and how patterns of different types can be combined:Markers
Now we can use these rules to find e.g. declarations in a document. However, we'd also be interested in identifying the components of this declaration (introduced identifier, restrictions, ...).$a \in M$ or $x \ge 0$ :
For this purpose, we can add markers to our patterns.
Here is a rule that matches and marks simple formulas that introduce and restrict identifiers like in
A marker has a name and optionally a list of tags associated with it. Markers can also be added to words and sequences of words. However, they are processed differently internally, as they correspond to ranges in the DNM, while math markers correspond to nodes in the DOM.
Currently, the only way to use the rules is by calling a
match_sentence
function, which takes a sentence and a seq_rule name and returns a list of all matches in that sentence.A match is contains the matched markers as a tree structure.
Insights From The Example Declaration Spotter
Using this pattern file, I created a small example spotter to test the pattern matching library.
As KAT doesn't support string offsets yet, I simply exported the results into an HTML file (attached as ZIP, because github didn't let me attach html). For simplicity, I ignored the tree structure of the resulting matches.
Insights: