Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Context words are used outside the suffix/prefix window #1444

Open
omri374 opened this issue Aug 26, 2024 · 5 comments
Open

Context words are used outside the suffix/prefix window #1444

omri374 opened this issue Aug 26, 2024 · 5 comments
Labels
Advanced bug Something isn't working good first issue Good for newcomers

Comments

@omri374
Copy link
Contributor

omri374 commented Aug 26, 2024

I'm new to Presidio (started working with the code yesterday), but I can't figure out why I'm getting the results I am. Code is below. It doesn't seem to be recognizing "cents" in the context. However, if I turn it to 'cent' everything works fine. But that brings up another question, if it's basing the suffix count on "dollars", why is 'Six' (in Sixty) tagged? I assume I'm misunderstanding something. Any help would be appreciated.

from typing import List
import pprint

from presidio_analyzer import (
    AnalyzerEngine,
    PatternRecognizer,
    EntityRecognizer,
    Pattern,
    RecognizerResult,
)
from presidio_analyzer.recognizer_registry import RecognizerRegistry
from presidio_analyzer.nlp_engine import NlpEngine, SpacyNlpEngine, NlpArtifacts
from presidio_analyzer.context_aware_enhancers import LemmaContextAwareEnhancer

text = "Will you be paying the entire balance of Five Hundred Thirty-Nine Dollars and Sixty-Seven cents?"

regex = r"(zero|one|two|three|four|five|six|seven|eight|nine)"
currency_pattern = Pattern(name="currency_pattern (strong)", regex=regex, score=.01)

currency_recognizer_with_context = PatternRecognizer(
    supported_entity='CURRENCY',
    patterns=[currency_pattern],
    context=[
        'dollars',
        'cents',
    ]
)

context_aware_enhancer = LemmaContextAwareEnhancer(
    context_similarity_factor=1, 
    min_score_with_context_similarity=1,
    context_prefix_count=0,
    context_suffix_count=6,
)

registry = RecognizerRegistry()
registry.add_recognizer(currency_recognizer_with_context)
analyzer = AnalyzerEngine(registry=registry, context_aware_enhancer=context_aware_enhancer)

res = analyzer.analyze(text = text, language='en')

Output:
[type: CURRENCY, start: 41, end: 45, score: 1, type: CURRENCY, start: 61, end: 65, score: 1, type: CURRENCY, start: 78, end: 81, score: 1, type: CURRENCY, start: 84, end: 89, score: 0.01]

Originally posted by @mmoody-vv in #1443

@omri374 omri374 added bug Something isn't working good first issue Good for newcomers Advanced labels Aug 26, 2024
@microsoft microsoft deleted a comment Aug 26, 2024
@omri374
Copy link
Contributor Author

omri374 commented Aug 26, 2024

This looks like a bug.

To reproduce:

res = analyzer.analyze(text = text, language='en', return_decision_process=True)

for ress in res:
    print()
    print(f"text: {text[ress.start:ress.end]}," 
    f"\nentity: {ress.entity_type}, "
    f"\nscore before: {ress.analysis_explanation.original_score}"
    f"\nscore context improvement: {ress.analysis_explanation.score_context_improvement}"
    f"\nsupporting context word: {ress.analysis_explanation.supportive_context_word}")
text: Five,
entity: CURRENCY, 
score before: 0.01
score context improvement: 0.99
supporting context word: dollars

text: Nine,
entity: CURRENCY, 
score before: 0.01
score context improvement: 0.99
supporting context word: dollars

text: Six,
entity: CURRENCY, 
score before: 0.01
score context improvement: 0.99
supporting context word: dollars

text: Seven,
entity: CURRENCY, 
score before: 0.01
score context improvement: 0
supporting context word: 

@hhobson
Copy link
Contributor

hhobson commented Aug 27, 2024

Looks like this might be due to the models Part-of-Speech tagging rather than a Presidio bug.

The above example uses the default NLP Spacy model en_core_web_lg with the text Will you be paying the entire balance of Five Hundred Thirty-Nine Dollars and Sixty-Seven cents.

  • When Dollars has a capital D, it is categorised as a Proper Noun and the lemma is "Dollars"
  • When dollars has a lowercase D, it is categorised as a Noun and the lemma is "dollar"
  • Due to the cents position in the sentence it is always categorised as a Noun with a lemma of "cent", whether or not the C is capitalised

This can be seen with the following code:

import spacy

nlp = spacy.load("en_core_web_lg")

texts = [
"Will you be paying the entire balance of Five Hundred Thirty-Nine Dollars and Sixty-Seven cents",
"Will you be paying the entire balance of Five Hundred Thirty-Nine Dollars and Sixty-Seven Cents",
"Will you be paying the entire balance of Five Hundred Thirty-Nine dollars and Sixty-Seven cents",
"Will you be paying the entire balance of Five Hundred Thirty-Nine dollars and Sixty-Seven Cents",
"Will you be paying the entire balance of Sixty-Seven Cents and Five Hundred Thirty-Nine dollars",
]

for text in texts:
    print("\n", text)

    print(
        "Text:",  # Text: The original word text.
        "Lemma:",  # Lemma: The base form of the word.
        "POS:",  # POS: The simple universal part-of-speech tag.
        "Tag:",  # Tag: The detailed part-of-speech tag.
        "Alpha:",  # Alpha: Is the token an alpha character?
        "Stop:",  # Stop: Is the token part of a stop list, i.e. the most common words of the language?
        sep="\t"
    )
    for token in nlp(text):
        print(token.text, token.lemma_, token.pos_, token.tag_, token.is_alpha, token.is_stop, sep="\t")

Interestingly if the en_core_web_sm model is used, then dollars is always categorised as a noun. So @mmoody-vv you could look at using this model.

As the LemmaContextAwareEnhancer compares the context words to lemmas rather than the actual words in the text, I think using the singular form of the words is best. I can't see this anywhere in the docs, happy to add it if this is correct @omri374? So in this case, using "dollar" and cent should give you the behavior you're expecting @mmoody-vv

@omri374
Copy link
Contributor Author

omri374 commented Aug 28, 2024

@hhobson thanks for this analysis! I found it surprising that Dollars's lemma is Dollars. This could be causing this issue. According to your analysis, it seems that a fix would be to lowercase the token prior to lemmatizing it, but that's not that straightforward as spaCy runs lemmatization and NER together, and we wouldn't want to pass a lowercase sentence as it would affect NER.

@hhobson
Copy link
Contributor

hhobson commented Sep 1, 2024

I agree, lowercasing the text doesn't feel the right thing to do. Especially as in this case the different sized spaCy models behaved differently, so things might change in future versions.

I think the best approach is to recommend using singular form context words, like dollar rather than dollars. When I tested this it produced the expected behavior of boosting the score.

@omri374
Copy link
Contributor Author

omri374 commented Sep 2, 2024

Would that solve the problem if the sentence has upper case plurals to begin with? We would end up comparing dollar with Dollars

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Advanced bug Something isn't working good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants