Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Several sentences just have a '?' or other single character. #415

Open
rhdunn opened this issue Aug 19, 2023 · 6 comments
Open

Several sentences just have a '?' or other single character. #415

rhdunn opened this issue Aug 19, 2023 · 6 comments
Labels
orthography spelling, punctuation, tokenization

Comments

@rhdunn
Copy link
Contributor

rhdunn commented Aug 19, 2023

The following sentences all have text = ?:

  1. email-enronsent09_02-0034
  2. email-enronsent09_02-0036
  3. email-enronsent22_01-0067
  4. email-enronsent26_02-0021
  5. email-enronsent26_02-0023
  6. email-enronsent26_02-0025
  7. email-enronsent26_02-0028 -- This one looks like it may be due to an invalid encoding processing U+A0 (NBWS) from the proceeding sentence.
  8. email-enronsent32_01-p0009
  9. email-enronsent33_01-0130
  10. email-enronsent33_01-0132
  11. email-enronsent33_01-0145

There are also several other sentences where the text is just a single character. Many of these don't have newpar annotations or similar to ensure that they don't combine and interfere with the surrounding sentences:

  1. answers-20111108044917AALAHtc_ans-0005 -- m
  2. answers-20111108084416AAoPgBv_ans-0010 -- %
  3. answers-20111108090913AAf83Jh_ans-p0004 -- 1 (list separator)
  4. answers-20111108090913AAf83Jh_ans-0011 -- 2 (list separator)
  5. answers-20111108090913AAf83Jh_ans-0017 -- 3 (list separator)
  6. answers-20111108090913AAf83Jh_ans-0021 -- 4 (list separator)
  7. email-enronsent27_01-0049 -- m
  8. email-enronsent27_01-0064 -- m
  9. newsgroup-groups.google.com_alt.animals_1054ad831ec01b4c_ENG_20031204_144900-0002 -- s

The following look ok as they have newpar annotations on the sentence and the following sentence:

  1. email-enronsent09_02-p0006 -- D
  2. email-enronsent09_02-p0015 -- D
  3. email-enronsent21_01-p0007 -- M
  4. newsgroup-groups.google.com_HumorUniversity_00dd93cc9545deb3_ENG_20051130_122700-p0001 -- *

I don't know what the best way to handle/markup these so they can be consistently processed (e.g. by tools like converting the CoNLL-U files to text), as there seem to be different cases here.

@nschneid
Copy link
Contributor

Thanks for pointing these out—my sense is that email/forum data is just going to be messy sometimes, and the notion of a paragraph or sentence unit isn't always clear. I wasn't involved in the original data preprocessing, but if you have access to the LDC release that might be informative regarding these cases.

@rhdunn
Copy link
Contributor Author

rhdunn commented Aug 19, 2023

No, I don't have access to the LDC release.

@arademaker
Copy link

Just to point out that during our work in the https://universalpropositions.github.io/, I also came across many of those sentences not only in the EWT corpus but also in the Ontonotes. Not clear for me the value of keeping those sentences in the treebanks.

@nschneid nschneid added the orthography spelling, punctuation, tokenization label Oct 14, 2023
@nschneid
Copy link
Contributor

nschneid commented Oct 14, 2023

I see your point that these don't add much, but if it's just a couple dozen sentences their effect will be miniscule. If we're going to go down the road of weeding out sentences that aren't really sentences, EWT has a ton that are just URLs, email signatures, or filenames. For a sample see https://universal.grew.fr/?custom=652b01f3387f2

@arademaker
Copy link

Still, it would make the data more maintainable.

@rhdunn
Copy link
Contributor Author

rhdunn commented Oct 14, 2023

Part of the problem with these (and other partial sentences) is that they can result in sentence splitters (those using statistical, neural network, or dependency parse based models) to incorrectly split new sentences in some cases. -- I have on my list of things to do to detect incorrectly split sentences in model output, as I've seen this happen quite frequently, for example:

# text = Lawyer
# text = Bell was there and made one 'bout eight months 'fore he died.

My iniitial thinking for cases such as the URLs, email signatures, and so forth is to ensure that the next sentence is the start of a new paragraph. Though I've not tested this yet, nor written the validation rules/logic.

The URL sentence is valid, as will many of the others. The tricky case is when they combine with other sentences in the generated test data, resulting in the splitter making the wrong inferences -- this is where I suspect that having newpar markup on these will help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
orthography spelling, punctuation, tokenization
Projects
None yet
Development

No branches or pull requests

3 participants