-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Several sentences just have a '?' or other single character. #415
Comments
Thanks for pointing these out—my sense is that email/forum data is just going to be messy sometimes, and the notion of a paragraph or sentence unit isn't always clear. I wasn't involved in the original data preprocessing, but if you have access to the LDC release that might be informative regarding these cases. |
No, I don't have access to the LDC release. |
Just to point out that during our work in the https://universalpropositions.github.io/, I also came across many of those sentences not only in the EWT corpus but also in the Ontonotes. Not clear for me the value of keeping those sentences in the treebanks. |
I see your point that these don't add much, but if it's just a couple dozen sentences their effect will be miniscule. If we're going to go down the road of weeding out sentences that aren't really sentences, EWT has a ton that are just URLs, email signatures, or filenames. For a sample see https://universal.grew.fr/?custom=652b01f3387f2 |
Still, it would make the data more maintainable. |
Part of the problem with these (and other partial sentences) is that they can result in sentence splitters (those using statistical, neural network, or dependency parse based models) to incorrectly split new sentences in some cases. -- I have on my list of things to do to detect incorrectly split sentences in model output, as I've seen this happen quite frequently, for example:
My iniitial thinking for cases such as the URLs, email signatures, and so forth is to ensure that the next sentence is the start of a new paragraph. Though I've not tested this yet, nor written the validation rules/logic. The URL sentence is valid, as will many of the others. The tricky case is when they combine with other sentences in the generated test data, resulting in the splitter making the wrong inferences -- this is where I suspect that having |
The following sentences all have
text = ?
:email-enronsent09_02-0034
email-enronsent09_02-0036
email-enronsent22_01-0067
email-enronsent26_02-0021
email-enronsent26_02-0023
email-enronsent26_02-0025
email-enronsent26_02-0028
-- This one looks like it may be due to an invalid encoding processing U+A0 (NBWS) from the proceeding sentence.email-enronsent32_01-p0009
email-enronsent33_01-0130
email-enronsent33_01-0132
email-enronsent33_01-0145
There are also several other sentences where the text is just a single character. Many of these don't have
newpar
annotations or similar to ensure that they don't combine and interfere with the surrounding sentences:answers-20111108044917AALAHtc_ans-0005
--m
answers-20111108084416AAoPgBv_ans-0010
--%
answers-20111108090913AAf83Jh_ans-p0004
--1
(list separator)answers-20111108090913AAf83Jh_ans-0011
--2
(list separator)answers-20111108090913AAf83Jh_ans-0017
--3
(list separator)answers-20111108090913AAf83Jh_ans-0021
--4
(list separator)email-enronsent27_01-0049
--m
email-enronsent27_01-0064
--m
newsgroup-groups.google.com_alt.animals_1054ad831ec01b4c_ENG_20031204_144900-0002
--s
The following look ok as they have
newpar
annotations on the sentence and the following sentence:email-enronsent09_02-p0006
--D
email-enronsent09_02-p0015
--D
email-enronsent21_01-p0007
--M
newsgroup-groups.google.com_HumorUniversity_00dd93cc9545deb3_ENG_20051130_122700-p0001
--*
I don't know what the best way to handle/markup these so they can be consistently processed (e.g. by tools like converting the CoNLL-U files to text), as there seem to be different cases here.
The text was updated successfully, but these errors were encountered: