Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement: Distinguish between error message and context in validation of spans #2329

Conversation

tomaarsen
Copy link
Contributor

Description

I expanded slightly on the error message provided when providing spans that do not match the tokenization.
Consider the following example script:

import argilla as rg

record = rg.TokenClassificationRecord(
    text = "This is my text",
    tokens=["This", "is", "my", "text"],
    prediction=[("ORG", 0, 6), ("PER", 8, 14)],
)

The (truncated) output on the develop branch:

ValueError: Following entity spans are not aligned with provided tokenization
Spans:
- [This i] defined in This is my ...
- [my tex] defined in ...s is my text
Tokens:
['This', 'is', 'my', 'text']

The distinction between defined in and This is is unclear. I've worked on this.

The (truncated) output after this PR:

ValueError: Following entity spans are not aligned with provided tokenization
Spans:
- [This i] defined in 'This is my ...'
- [my tex] defined in '...s is my text'
Tokens:
['This', 'is', 'my', 'text']

Note the additional '. Note that the changes rely on repr, so if the snippet contains ' itself, it uses " instead, e.g.:

import argilla as rg

record = rg.TokenClassificationRecord(
    text = "This is Tom's text",
    tokens=["This", "is", "Tom", "'s", "text"],
    prediction=[("ORG", 0, 6), ("PER", 8, 16)],
)
ValueError: Following entity spans are not aligned with provided tokenization
Spans:
- [This i] defined in 'This is Tom...'
- [Tom's te] defined in "...s is Tom's text"
Tokens:
['This', 'is', 'Tom', "'s", 'text']

Type of change

  • Improvement (change adding some improvement to an existing functionality)

How Has This Been Tested

Modified the relevant tests, ensured they worked.

Checklist

  • I have merged the original branch into my forked branch

  • I added relevant documentation

  • follows the style guidelines of this project

  • I did a self-review of my code

  • I added comments to my code

  • I made corresponding changes to the documentation

  • My changes generate no new warnings

  • I have added tests that prove my fix is effective or that my feature works

  • Tom Aarsen

@tomaarsen tomaarsen added the type: enhancement Indicates new feature requests label Feb 9, 2023
@codecov
Copy link

codecov bot commented Feb 9, 2023

Codecov Report

Base: 92.55% // Head: 91.83% // Decreases project coverage by -0.72% ⚠️

Coverage data is based on head (ceae0a6) compared to base (d354274).
Patch coverage: 100.00% of modified lines in pull request are covered.

❗ Current head ceae0a6 differs from pull request most recent head bbe5d85. Consider uploading reports for the commit bbe5d85 to get more accurate results

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #2329      +/-   ##
===========================================
- Coverage    92.55%   91.83%   -0.72%     
===========================================
  Files          159      161       +2     
  Lines         7840     7897      +57     
===========================================
- Hits          7256     7252       -4     
- Misses         584      645      +61     
Flag Coverage Δ
pytest 91.83% <100.00%> (-0.72%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/argilla/utils/span_utils.py 100.00% <100.00%> (ø)
src/argilla/client/sdk/text_classification/api.py 87.09% <0.00%> (-12.91%) ⬇️
src/argilla/client/client.py 87.50% <0.00%> (-6.49%) ⬇️
...gilla/labeling/text_classification/label_errors.py 86.58% <0.00%> (-3.66%) ⬇️
src/argilla/client/datasets.py 74.62% <0.00%> (-2.34%) ⬇️
...la/server/apis/v0/handlers/token_classification.py 80.00% <0.00%> (-0.29%) ⬇️
src/argilla/server/server.py 84.09% <0.00%> (-0.18%) ⬇️
.../argilla/server/services/tasks/text2text/models.py 97.22% <0.00%> (-0.15%) ⬇️
...argilla/server/services/tasks/text2text/service.py 96.66% <0.00%> (-0.11%) ⬇️
src/argilla/monitoring/base.py 90.40% <0.00%> (-0.08%) ⬇️
... and 13 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@davidberenstein1957
Copy link
Member

davidberenstein1957 commented Feb 10, 2023

Hi @tomaarsen

Looks much more readable! Perhaps it makes sense to directly link them to the provided input. Also, I think the substring part (this is tom...) can be a bit longer for referencing.

("ORG", 0, 6) - 'This i' defined in 'This is Tom...'

@tomaarsen
Copy link
Contributor Author

Even better. I also considered output along the lines of this:

("ORG", 0, 6): 'This is Tom...'
                ^^^^^^

But I suspect that we would be best off implementing #1843 instead and using colors.
That said, we may want to have a "color toggle" of sorts for users directly logging away into files, in which case this solution would work well.

Thoughts?

I'm down to update this PR according to your suggestion, by the way.

@davidberenstein1957
Copy link
Member

davidberenstein1957 commented Feb 10, 2023

Yes, I agree that something like that would even be better, but might require a bit more custom code/logic, so I would, for now, go for a quick win, and based on the initial rich improvements continue fine-tuning from there.

So, if it doesn't take up too much time, the last proposed design seems best to me.

@tomaarsen
Copy link
Contributor Author

I've updated the error to match the proposed message from #2329 (comment). I think we should skip the proposed error from #2329 (comment).

import argilla as rg

record = rg.TokenClassificationRecord(
    text = "This is Tom's text",
    tokens=["This", "is", "Tom", "'s", "text"],
    prediction=[("ORG", 0, 6), ("PER", 8, 16)],
)
ValueError: Following entity spans are not aligned with provided tokenization
Spans:
('ORG', 0, 6) - [This i] defined in 'This is Tom...'
('PER', 8, 16) - [Tom's te] defined in "...s is Tom's text"
Tokens:
['This', 'is', 'Tom', "'s", 'text']

I think this PR is ready. The next step could come when #1843 is implemented.

  • Tom Aarsen

@davidberenstein1957
Copy link
Member

davidberenstein1957 commented Feb 15, 2023

Hi @tomaarsen ,

Sorry, readability is still a bit off for me. I think the "defined in" makes it a bit unclear
I think I prefer the second option.

Spans:
("ORG", 0, 6, "This i") 
("PER", 8, 16, "Tom's te") 
Tokens:
['This', 'is', 'Tom', "'s", 'text']
Spans:
("ORG", 0, 6) - "This i"
("PER", 8, 16) - "Tom's te" 
Tokens:
['This', 'is', 'Tom', "'s", 'text']
Spans:
("ORG", 0, 6, "This i") - "This is Tom..."
("PER", 8, 16, "Tom's te") - "...s is Tom's text"
Tokens:
['This', 'is', 'Tom', "'s", 'text']

@tomaarsen
Copy link
Contributor Author

No worries, I'm open to iterating here.
I think options 1 and 3 from your proposal here are unintuitive as you strictly provide ("ORG", 0, 6) as a part of the prediction, and these proposals arbitrarily include e.g. "This i" in the tuple. I like option 2, but we can also make it more word-y:

These spans do not correspond with token boundaries:
("ORG", 0, 6) corresponds with "This i" in "This is Tom..."
("PER", 8, 16) corresponds with "Tom's te" in "...s is Tom's text"
Tokens:
['This', 'is', 'Tom', "'s", 'text']

I'm curious to hear your thoughts.

@davidberenstein1957
Copy link
Member

I wanted to avoid wordiness because we are dealing with a lot of text/words. If you feel option 2 is too unintuitive, I would prefer your final proposal.

@tomaarsen
Copy link
Contributor Author

No, I think option 2 is fairly clear, especially as we must consider that the snippet follows "Following entity spans are not aligned with provided tokenization". I'll get on it :)

@tomaarsen
Copy link
Contributor Author

Updated the output:

import argilla as rg

record = rg.TokenClassificationRecord(
    text = "This is Tom's text",
    tokens=["This", "is", "Tom", "'s", "text"],
    prediction=[("ORG", 0, 6), ("PER", 8, 16)],
)
ValueError: Following entity spans are not aligned with provided tokenization
Spans:
('ORG', 0, 6) - 'This i'
('PER', 8, 16) - "Tom's te"
Tokens:
['This', 'is', 'Tom', "'s", 'text']

I'm open to any additional feedback.

@davidberenstein1957 davidberenstein1957 merged commit 179ffb9 into argilla-io:develop Feb 15, 2023
@tomaarsen tomaarsen deleted the enhancement/improve_span_validation_error branch February 15, 2023 15:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: enhancement Indicates new feature requests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants