-
Notifications
You must be signed in to change notification settings - Fork 4
WeSearch_StarSem
Some notes on the StarSEM 2012 shared task. I've used similar annotation conventions to our previous work, with <> for cues, {} for scope and now [] for events. For papers though we should probably follow Morante et al's (2011) conventions of bold for cues, underline for scope and italic for events.
- Entire sentence, except initial/final punctuation: P=64.20 R=17.59 F1=27.61
- From cue to left and right punctuation or sentence boundary: P=97.95, R=32.36, F1=48.65
The files are provided in CONLL format, with the first 7 columns corresponding to:
- Book_Chapter
- Sentence number within chapter
- token number within sentence
- word
- lemma
- part-of-speech
- syntax
If the sentence does not have negations:
- 8. ***
Otherwise there are three columns per negation:
- (8,11,14, ...) word (or part of word) that is part of the cue
- (9,12,15, ...) word (or part of word) that is part of the scope
- (10,13,16, ...) word (or part of word) that is part of the event
3,644 sentences with 986 instances of negation.
98 instances have no scope; 93 instances have a discontinuous scope that is not bridged by the cue.
Of the remaining 795 instances, 439 are aligned with some constituent in the C&J parses. Applying out bioscope slackening heuristics:
- constituent final punctuation (+93)
- constituent initial punctuation (+37)
- initial adverbs when not the cue (+9)
- scope starts with cue when it is a noun (-1)
- scope does not start with an auxiliary (-6)
Applying only the beneficial heuristics leaves us with an alignment rate of 72.7%.
The following are listings of instances with: no scope, discontinuous scope, scope that is not aligned with a constituent and scope that is aligned with a constituent--- in which instances are delimited by double newlines; the first line is an instance identifier comprised of: chapter <TAB> sentence in chapter <TAB> negation in sentence. The second line is the tokens of the sentence, where cues are indicated with < >, scope with { }, and the most specific subsuming constituent of the scope with _ _.
Additional slackening heuristics for CD:
- if a non-NP node on path from cue to subsumer has a sibling CC, scope starts before CC
- move in from initial CC, UH, ADVP or INTJ
Current alignment rate is 81.6% of continuous scopes.
TRAINING | DEVELOPMENT | |||||
Freq. | Cue | PoS | Freq. | Cue | PoS | |
326 | not | RB | 39 | not | RB | |
139 | no | DT | 27 | no | DT | |
72 | un* | JJ | 20 | n't | RB | |
65 | n't | RB | 16 | nothing | NN | |
59 | never | RB | 12 | un* | JJ | |
59 | no | UH | 11 | never | RB | |
55 | nothing | NN | 7 | without | IN | |
27 | no | RB | 5 | im* | JJ | |
24 | without | IN | 4 | nor | CC | |
22 | *not | RB | 4 | in* | JJ | |
20 | *less | JJ | 4 | no | UH | |
17 | in* | JJ | 3 | un* | RB | |
16 | im* | JJ | 3 | *less | JJ | |
12 | none | NN | 2 | neither_*_nor | DT_*_CC | |
6 | nor | CC | 1 | nobody | NN | |
4 | in* | RB | 1 | in* | RB | |
4 | un* | RB | 1 | dis* | VBN | |
4 | *less* | RB | 1 | dis* | NN | |
4 | ir* | JJ | 1 | save | IN | |
3 | *less | NN | 1 | ir* | JJ | |
3 | dis* | NN | 1 | *not | NN | |
2 | im* | RB | 1 | no_*_nor | DT_*_CC | |
2 | nowhere | RB | 1 | *n* | RB | |
2 | neither_*_nor | DT_*_CC | 1 | more | JJR | |
2 | *not | NN | 1 | im* | NN | |
2 | *not | VBD | 1 | neither | DT | |
2 | prevent | VB | 1 | no_more | DT_RBR | |
2 | *not | VBP | 1 | by_no_means | IN_DT_NNS | |
2 | on_the_contrary | IN_DT_NN | 1 | un* | VBN | |
2 | by_no_means | IN_DT_NNS | ||||
1 | rather_than | RB_IN | ||||
1 | by_no_means | IN_DT_VBZ | ||||
1 | nobody | NN | ||||
1 | ir* | RB | ||||
1 | fail | VBP | ||||
1 | no* | NN | ||||
1 | un | IN | ||||
1 | absence | NN | ||||
1 | nothing_at_all | NN_IN_DT | ||||
1 | neglected | VBN | ||||
1 | dis* | VBN | ||||
1 | refused | VBD | ||||
1 | no | NNP | ||||
1 | in* | NNS | ||||
1 | un* | IN | ||||
1 | ir* | NN | ||||
1 | not_the | RB_DT | ||||
1 | not_for_the_world | RB_IN_DT_NN | ||||
1 | save | VB | ||||
1 | except | VB | ||||
1 | *less* | JJ | ||||
1 | unusual | JJ | ||||
1 | *less* | NN | ||||
1 | un* | NN | ||||
1 | dis* | JJ | ||||
1 | not_*_not | RB_*_RB | ||||
1 | un* | VBN |
3,640 sentences with 989 instances of negation.
99 instances have no scope; 92 instances have a discontinuous scope that is not bridged by the cue.
Of the remaining 798 instances, 80 are aligned with some constituent. Applying our bioscope slackening heuristics:
- constituent final punctuation (+437)
- constituent initial punctuation (+70)
- initial adverbs when not the cue (+8)
- scope starts with cue when it is a noun (+2)
- scope does not start with an auxiliary (-9)
Applying only the beneficial heuristics leaves us with an alignment rate of 73.4%.
...
371 instances have no event; 14 instances have discontinuous events. In 6 instances the event lies outside of the scope---these seem to be annotation errors:
-
... only {an} <un>[ambitious] {one who abandons a London career for the country} ...
-
... {an} <un>[justifiable] {intrusion}, ...
-
{It} <never> [recovered] {from the blow}, ...
-
"But {I} [can]<'t> {forget them}, Miss Stapleton," said I.
-
... and means to [spare] <no> {pains or expense} to restore the grandeur of his family.
-
Coming down with an <un>[signed] {warrant}.
...
Collins' coverage of the training data is 99.4% (21 of 3,640 sentence). In those 21 there are 10 instances of negation, for example:
- "Know then that in the time of the Great Rebellion (the history of which by the learned Lord Clarendon I most earnestly commend to your attention) this Manor of Baskerville has held by Hugo of that name, nor <can> {[it] be gainsaid that he was a most wild, profane, and {[god]}<less> man}.
Training | Development | ||
Freq. | Word | Freq. | Word |
35 | don't | 17 | 't |
11 | can't | 3 | don't |
7 | n't | ||
6 | isn't | ||
5 | didn't | ||
2 | couldn't |
Of the training data bigrams ending in n't there are:
- 4 do n't
- 1 did n't
- 1 had n't
- 1 wo n't
Of the development data bigrams ending in 't there are:
- 7 don 't
- 4 can 't
- 3 didn 't
- 1 couldn 't
- 1 shan 't
- 1 wasn 't
There is a full listing of tokens containing punctuation here: JimWhite/StarSemTokenTabulation.
HoundOfTheBaskervilles_ch1, s1. prefixed cue, weirdness
-
Mr. Sherlock Holmes, who was usually very late in the mornings, save upon {those} not <in>{frequent occasions when he was up all night}, was seated at the breakfast table.
-
Mr. Sherlock Holmes, who was usually very late in the mornings, save upon {those} <not> {infrequent occasions when he was up all night}, was seated at the breakfast table.
-
Mr. Sherlock Holmes, {who was} usually {very late in the mornings,} <save> {upon those not infrequent occasions when he was up all night}, was seated at the breakfast table.
HoundOfTheBaskervilles_ch1, s12, prefixed cue
- Since {we have been so} <un>{[fortunate]] {as to miss him} and have no notion of his errand, this accidental souvenir becomes of importance.
HoundOfTheBaskervilles_ch1, s67: discontinuous scope
- If {he was} in the hospital and yet <not> {on the staff} he could only have been a house-surpeon or a house-physician: little more than a senior student.
HoundOfTheBaskervilles_ch1, s8: weirdness
- It is my experience that it is only an amiable man in this world who receives testimonials, only {an} <un>[ambitious] {one who abandons a London career for the country}, and only an absent-minded one who leaves his stick and not his visiting-card after waiting an hour in your room.
HoundOfTheBaskervilles_ch1, s89: discontinuous scope
- {The dog's jaw}, as shown in the space between these marks, {is} too broad in my opinion for a terrier and <not> {[broad] enough for a mastiff}.
HoundOfTheBaskervilles_ch3, s235: Multi-word cue, discontinuous scope
- Then, again, whom was he waiting for that night, and why was {he [waiting] for him} in the yew alley <rather than> {in his own house}?"
HoundOfTheBaskervilles_ch4, s154: contracted cue
- But as to my uncle's death: well, it all seems boiling up in my head, and {I [can]}<'t> {get it clear yet}.
HoundOfTheBaskervilles_ch4, s233: Abbreviation of "number" tagged as negation
- {No.} 2704 is our man .
Frq. | Cue | POS |
346 | not | RB |
137 | no | DT |
71 | un | JJ |
64 | no | UH |
58 | never | RB |
55 | nothing | NN |
36 | n't | RB |
24 | without | IN |
22 | less | JJ |
18 | no | RB |
17 | in | JJ |
16 | im | JJ |
12 | none | NN |
8 | n't | JJ |
6 | 't | RB |
6 | n't | VB |
5 | n't | NN |
5 | no | NNP |
5 | ir | JJ |
4 | nor | CC |
4 | un | RB |
4 | less | RB |
4 | in | RB |
3 | dis | NN |
3 | not | VB |
3 | less | NN |
2 | '<NULL>' | '<NULL>' |
2 | not | JJ |
2 | un | NN |
2 | not | NN |
2 | un | IN |
2 | nowhere | RB |
2 | by_no_means | IN_DT_NN |
2 | prevent | VB |
2 | n't | NNP |
2 | 't | NN |
2 | im | RB |
2 | on_the_contrary | IN_DT_NN |
1 | rather_than | RB_IN |
1 | nobody | NN |
1 | been | VBN |
1 | fail | VBP |
1 | neither_*_nor | CC_*_CC |
1 | absence | NN |
1 | other | JJ |
1 | nothing_at_all | NN_IN_DT |
1 | can | MD |
1 | neglected | VBN |
1 | ir | RB |
1 | un | VBG |
1 | refused | VBD |
1 | the | DT |
1 | yet | RB |
1 | never | NNP |
1 | save | VBP |
1 | not_for_the_world | RB_IN_DT_NN |
1 | un | VBN |
1 | signs | NNS |
1 | in | NNS |
1 | no | JJ |
1 | unusual | JJ |
1 | dis | VBN |
1 | neither_*_nor | DT_*_CC |
1 | by_no_means | IN_RB_VBZ |
1 | not_*_not | RB_*_RB |
1 | except | IN |
1 | dis | JJ |
The full list is here. There are 367 token/pos types.
Frq. | Word | POS |
51 | could | MD |
25 | can | RB |
19 | have | VBP |
14 | had | VBD |
12 | know | VB |
10 | know | VBP |
7 | able | JJ |
7 | seen | VBN |
6 | happy | JJ |
5 | pleasant | JJ |
5 | like | IN |
5 | sign | NN |
5 | say | VB |
5 | man | NN |
4 | likely | JJ |
4 | heard | VBN |
4 | saw | VBD |
4 | can | MD |
4 | possible | JJ |
4 | known | JJ |
Home | Forum | Discussions | Events