Skip to content

Commit

Permalink
Change the order in which patterns are run.
Browse files Browse the repository at this point in the history
Previously patterns have been run in the order they are defined in the
pattern file with general patterns run first, then script patterns,
then language patterns and finally country patterns. Change this so
that patterns with the same name are always run one after the other in
the position of the earliest one.

For example, consider the following patterns defined in the pattern
files in the following order.

    Latn.common-error:    A, B, C
    Latn-fi.common-error: B, D, E

Previously these were run in order A, B, C, B, D, E. With this change
they will now be run in order A, B, B, C, D, E or if the policy of the
B pattern in Latn-fi has been set to 'Replace', the A, B, C, D, E.
This should make it more convenient to replace or extend patterns
above in the hierarchy.

Adapt Latn-fr.common-error.in to the above order and add
Latn-fi.common-error.in to use the new order.
  • Loading branch information
otsaloma committed May 10, 2009
1 parent f320a6d commit 119e18d
Show file tree
Hide file tree
Showing 5 changed files with 42 additions and 12 deletions.
4 changes: 4 additions & 0 deletions data/patterns/Latn-fi.common-error.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
<?xml version="1.0" encoding="utf-8"?>
<patterns>
<pattern name="Space after punctuation marks" enabled="true"/>
</patterns>
11 changes: 11 additions & 0 deletions data/patterns/Latn-fi.common-error.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
[Common Error Pattern]
_Name=Space after punctuation marks
_Description=Add space after various punctuation marks
Classes=Human;OCR;
# Same as the Latin pattern, except no space after colon to accommodate
# Finnish suffixes, e.g. 'TV:ssä'.
Pattern=((\w|^|["'«»])[,;?!])(?!["'«»])([^\W\d][\w\s])
Flags=DOTALL;MULTILINE;UNICODE;
Replacement=\1 \3
Repeat=False
Policy=Replace
14 changes: 12 additions & 2 deletions data/patterns/Latn-fr.common-error.in
Original file line number Diff line number Diff line change
Expand Up @@ -36,9 +36,19 @@ Repeat=False

[Common Error Pattern]
_Name=Space before punctuation marks
_Description=Remove space before various punctuation marks
_Description=Add or remove space before various punctuation marks
Classes=Human;OCR;
# Same as the Latin pattern, except keep space before [?!;:].
Pattern= +(["'«»]?(?!\.\.)([,.])(?!\d))
Flags=DOTALL;MULTILINE;UNICODE;
Replacement=\1
Repeat=False
Policy=Replace

[Common Error Pattern]
_Name=Space before punctuation marks
_Description=Add or remove space before various punctuation marks
Classes=Human;OCR;
# Readd space removed by the corresponding pattern in Latn.
Pattern=([^\s?!;:])([?!;:])(?!\d)
Flags=DOTALL;MULTILINE;UNICODE;
Replacement=\1 \2
Expand Down
5 changes: 3 additions & 2 deletions data/patterns/Latn.common-error.in
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ Repeat=False
_Name=Space before punctuation marks
_Description=Remove space before various punctuation marks
Classes=Human;OCR;
# NOTE: This goes against French spacing, Latn-fr can override.
# NOTE: Unsuitable for French, see Latn-fr.
# Disallow trailing digits for the case of fractions, e.g. '.45'.
Pattern= +(["'«»]?(?!\.\.)([,;:.?!])(?!\d))
Flags=DOTALL;MULTILINE;UNICODE;
Expand All @@ -86,7 +86,8 @@ Repeat=False
_Name=Space after punctuation marks
_Description=Add space after various punctuation marks
Classes=Human;OCR;
Pattern=((\w|^|["'«»])[,;:?!])(?!["'«»])([^\W\d][\w\s])
# NOTE: Unsuitable for Finnish, see Latn-fi.
Pattern=((\w|^|["'«»]) ?[,;:?!])(?!["'«»])([^\W\d][\w\s])
Flags=DOTALL;MULTILINE;UNICODE;
Replacement=\1 \3
Repeat=False
Expand Down
20 changes: 12 additions & 8 deletions gaupol/patternman.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,18 +70,22 @@ def _filter_patterns(self, patterns):
Patterns with a more specific code replace those with a less specific
code if they have the same name and the more specific pattern's policy
is explicitly set to 'Replace' (instead of the implicit 'Append').
Maintain order so that all patterns with the same name are always
located in the position of the earliest of such patterns.
"""
filtered_patterns = []
for i, pattern in enumerate(patterns):
replacement_found = False
name = pattern.get_name(False)
for j in range(i + 1, len(patterns)):
j_name = patterns[j].get_name(False)
j_policy = patterns[j].get_field("Policy")
if (j_name == name) and (j_policy == "Replace"):
replacement_found = True
if not replacement_found:
filtered_patterns.append(pattern)
policy = pattern.get_field("Policy")
last_index = len(filtered_patterns) - 1
for j, filtered_pattern in enumerate(filtered_patterns):
if filtered_pattern.get_name() == name:
last_index = j
if policy == "Replace":
filtered_patterns[j] = None
filtered_patterns.insert(last_index + 1, pattern)
while None in filtered_patterns:
filtered_patterns.remove(None)
return filtered_patterns

def _get_codes_require(self, script=None, language=None, country=None):
Expand Down

0 comments on commit 119e18d

Please sign in to comment.