Several sentences just have a '?' or other single character. #415

rhdunn · 2023-08-19T08:15:47Z

The following sentences all have text = ?:

email-enronsent09_02-0034
email-enronsent09_02-0036
email-enronsent22_01-0067
email-enronsent26_02-0021
email-enronsent26_02-0023
email-enronsent26_02-0025
email-enronsent26_02-0028 -- This one looks like it may be due to an invalid encoding processing U+A0 (NBWS) from the proceeding sentence.
email-enronsent32_01-p0009
email-enronsent33_01-0130
email-enronsent33_01-0132
email-enronsent33_01-0145

There are also several other sentences where the text is just a single character. Many of these don't have newpar annotations or similar to ensure that they don't combine and interfere with the surrounding sentences:

answers-20111108044917AALAHtc_ans-0005 -- m
answers-20111108084416AAoPgBv_ans-0010 -- %
answers-20111108090913AAf83Jh_ans-p0004 -- 1 (list separator)
answers-20111108090913AAf83Jh_ans-0011 -- 2 (list separator)
answers-20111108090913AAf83Jh_ans-0017 -- 3 (list separator)
answers-20111108090913AAf83Jh_ans-0021 -- 4 (list separator)
email-enronsent27_01-0049 -- m
email-enronsent27_01-0064 -- m
newsgroup-groups.google.com_alt.animals_1054ad831ec01b4c_ENG_20031204_144900-0002 -- s

The following look ok as they have newpar annotations on the sentence and the following sentence:

email-enronsent09_02-p0006 -- D
email-enronsent09_02-p0015 -- D
email-enronsent21_01-p0007 -- M
newsgroup-groups.google.com_HumorUniversity_00dd93cc9545deb3_ENG_20051130_122700-p0001 -- *

I don't know what the best way to handle/markup these so they can be consistently processed (e.g. by tools like converting the CoNLL-U files to text), as there seem to be different cases here.

The text was updated successfully, but these errors were encountered:

nschneid · 2023-08-19T14:55:02Z

Thanks for pointing these out—my sense is that email/forum data is just going to be messy sometimes, and the notion of a paragraph or sentence unit isn't always clear. I wasn't involved in the original data preprocessing, but if you have access to the LDC release that might be informative regarding these cases.

rhdunn · 2023-08-19T18:03:14Z

No, I don't have access to the LDC release.

arademaker · 2023-08-19T18:35:57Z

Just to point out that during our work in the https://universalpropositions.github.io/, I also came across many of those sentences not only in the EWT corpus but also in the Ontonotes. Not clear for me the value of keeping those sentences in the treebanks.

nschneid · 2023-10-14T21:05:26Z

I see your point that these don't add much, but if it's just a couple dozen sentences their effect will be miniscule. If we're going to go down the road of weeding out sentences that aren't really sentences, EWT has a ton that are just URLs, email signatures, or filenames. For a sample see https://universal.grew.fr/?custom=652b01f3387f2

arademaker · 2023-10-14T21:22:08Z

Still, it would make the data more maintainable.

rhdunn · 2023-10-14T21:57:20Z

Part of the problem with these (and other partial sentences) is that they can result in sentence splitters (those using statistical, neural network, or dependency parse based models) to incorrectly split new sentences in some cases. -- I have on my list of things to do to detect incorrectly split sentences in model output, as I've seen this happen quite frequently, for example:

# text = Lawyer
# text = Bell was there and made one 'bout eight months 'fore he died.

My iniitial thinking for cases such as the URLs, email signatures, and so forth is to ensure that the next sentence is the start of a new paragraph. Though I've not tested this yet, nor written the validation rules/logic.

The URL sentence is valid, as will many of the others. The tricky case is when they combine with other sentences in the generated test data, resulting in the splitter making the wrong inferences -- this is where I suspect that having newpar markup on these will help.

nschneid added the orthography spelling, punctuation, tokenization label Oct 14, 2023

nschneid mentioned this issue Oct 28, 2023

Inconsistent annotations for LS numbers #464

Closed

nschneid mentioned this issue Nov 4, 2023

Tokenization of space-separated ellipsis. UniversalDependencies/docs#988

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Several sentences just have a '?' or other single character. #415

Several sentences just have a '?' or other single character. #415

rhdunn commented Aug 19, 2023

nschneid commented Aug 19, 2023

rhdunn commented Aug 19, 2023

arademaker commented Aug 19, 2023

nschneid commented Oct 14, 2023 •

edited

Loading

arademaker commented Oct 14, 2023

rhdunn commented Oct 14, 2023

Several sentences just have a '?' or other single character. #415

Several sentences just have a '?' or other single character. #415

Comments

rhdunn commented Aug 19, 2023

nschneid commented Aug 19, 2023

rhdunn commented Aug 19, 2023

arademaker commented Aug 19, 2023

nschneid commented Oct 14, 2023 • edited Loading

arademaker commented Oct 14, 2023

rhdunn commented Oct 14, 2023

nschneid commented Oct 14, 2023 •

edited

Loading