Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent annotations for LS numbers #464

Closed
rhdunn opened this issue Oct 28, 2023 · 2 comments
Closed

Inconsistent annotations for LS numbers #464

rhdunn opened this issue Oct 28, 2023 · 2 comments

Comments

@rhdunn
Copy link
Contributor

rhdunn commented Oct 28, 2023

Validation issues:

ERROR: Sentence answers-20111108024148AAO8oFI_ans-0010 token 12 -- invalid X form '1'
ERROR: Sentence email-enronsent24_01-0014 token 5 -- invalid X form '20'
ERROR: Sentence email-enronsent24_01-0057 token 4 -- invalid X form '20'
ERROR: Sentence email-enronsent24_01-0114 token 4 -- invalid X form '20'
ERROR: Sentence answers-20111108090913AAf83Jh_ans-0007 token 1 -- invalid X form '1'
ERROR: Sentence answers-20111108090913AAf83Jh_ans-0011 token 1 -- invalid X form '2'
ERROR: Sentence answers-20111108090913AAf83Jh_ans-0017 token 1 -- invalid X form '3'
ERROR: Sentence answers-20111108090913AAf83Jh_ans-0021 token 1 -- invalid X form '4'
ERROR: Sentence answers-20111108073322AA27tkh_ans-0012 token 2 -- invalid X form '2'

There are several issues here:

  1. These should be NUM instead of X to be consistent with the other LS annotations.
  2. They should be attached to the following sentence to be consistent with how the other LS+NUM tokens are grouped.
  3. The LS tokens are missing NumType=Ord|NumForm=Digit features -- there may be other cases like this.

Note: I'm using NumType=Ord here instead of Card as these are ordered values -- first, second, third, etc. -- not counted values.

@rhdunn
Copy link
Contributor Author

rhdunn commented Oct 28, 2023

Looking across the different treebanks, the EWT treebank is separating the (1)/i)/etc. into separate tokens, whereas GUM and GENTLE are keeping them as a single token.

They are also keeping multi-section list items grouped, such as in 2.1.. I don't think EWT has examples of that in its data set.

@nschneid
Copy link
Contributor

nschneid commented Oct 28, 2023

These should be NUM instead of X to be consistent with the other LS annotations.

Thanks. A Grew-match query for these:

See also #440

They are also keeping multi-section list items grouped, such as in 2.1.. I don't think EWT has examples of that in its data set.

See email-enronsent38_01-0002 and successive sentences. They are kept as one token.

They should be attached to the following sentence to be consistent with how the other LS+NUM tokens are grouped.

Perhaps, but I'm guessing they were separated in the original text with newlines or something. Messing with the sentence boundaries is something I'm a little reluctant to do...let's move that discussion to #415.

The LS tokens are missing NumType=Ord|NumForm=Digit features -- there may be other cases like this.

Will open a separate issue for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants