clue prompt templates #808

yongzx · 2022-07-27T03:05:02Z

Add prompt templates to the CLUE benchmark tasks.

Currently:

3 original task prompt templates
1 non-original task prompt template

WIP:

NLI (ocnli, cmnli)
IFLYTEK: it has 119 label names, and I haven’t populated the answer-choices yet.
ChID: weird dataset structure where each example contains 4 subexamples. It needs the model to find 4 correct idioms for those 4 examples. The metrics are accuracy, and I am thinking of linearizing the dataset.

promptsource/templates/clue/cmnli/templates.yaml

Muennighoff · 2022-07-27T16:17:13Z

promptsource/templates/clue/c3/templates.yaml

+    jinja: "Question: \"{{question}}\"\nAnswer choices: {{ answer_choices[:-1] | join(',\
+      \ ') }}, or  {{ answer_choices[-1] }}?\nPassage: {% for statement in context\
+      \ %} \n{{ statement }}\n{% endfor %}\n|||\n{{ answer }}"


This renders as e.g. ['Given the dialogue / passage below, what is the answer for the question "根据对话，可以知道什么?"\nAnswer choices: 今天天气不好, 比赛时间变了, or 校长忘了时间?\n \n男：足球比赛是明天上午八点开始吧?\n \n女：因为天气不好，比赛改到后天下午三点了。', '比赛时间变了']

How should the model know where the passage actually ends?
It may be reasonable to just continue the previous passage. Fine-tuning on such examples may lead to generation quality decreasing I think cc @thomasw21

I don't understand the point you're making. I don't know why the rendering is a list ...

My bad the first one is the input, the second the answer. So if we separate them with a whitespace the model will get:

Given the dialogue / passage below, what is the answer for the question "根据对话，可以知道什么?"\nAnswer choices: 今天天气不好, 比赛时间变了, or 校长忘了时间?\n \n男：足球比赛是明天上午八点开始吧?\n \n女：因为天气不好，比赛改到后天下午三点了。比赛时间变了

Ah I understand, yeah good point, I think putting the passage above makes more sense

{passage} Given the dialogue / passage below, what is the answer for the question "根据对话，可以知道什么?"\nAnswer choices: 今天天气不好, 比赛时间变了, or 校长忘了时间?

One other way of doing it is have a between input and target, which we've been avoiding but in this case it might make sense?

Nit: also you can remove the answer choices for him to figure out (much harder task, but would maybe help quite a bit the training?

is have a between you mean have a EOS token?

Agreed, let's add another prompt without answer choices if you agree @yongzx?

Yeap I can do that (moving the passage before the task description, and adding prompts without answer choices).

@Muennighoff @thomasw21 For prompts without answer choices, should I mark it as non-original because I think we should use ROUGE or other generation metrics instead of original metric "accuracy", as we are no longer choosing answer from the given answer options.

I don't know, I'm not too familiar with how that terminology is used for. Perhaps @VictorSanh can help to know this kind of things.

promptsource/templates/clue/cmrc2018/templates.yaml

Muennighoff · 2022-07-27T16:37:57Z

promptsource/templates/clue/csl/templates.yaml

+subset: csl
+templates:
+  219679f8-a02f-4ee3-91c7-9ed4726dd828: !Template
+    answer_choices: no ||| yes


Suggested change

answer_choices: no ||| yes

answer_choices: yes ||| no

Currently I get the below:
['Do these keywords "纳米粒子, 干细胞, 氧化物, 顺磁性" represent key concepts in the abstract "目的探讨常见氧化铁纳米粒子几种神经干细胞标记技术的标记效率.材料与方法使用超顺磁性氧化铁纳米粒子(SPIO)和超微超顺磁性氧化铁纳米粒子(USPIO)以25μgFe/ml分别单独标记、与多聚赖氨酸(PLL)及脂质体联合标记神经干细胞,以未标记细胞做对照,采用普鲁士蓝染色评价细胞标记率,并采用4.7TMRIT2WI多回波序列测量T2弛豫率(R2)评价细胞内的铁摄取量,比较各组R2的差异.结果①普鲁士蓝染色结果:SPIO及USPIO单独标记组标记率为60％～70％,低于联合标记组的100％;②MRI结果:未标记细胞R2为(2.10±0.11)/s,SPIO、USPIO单独标记组细胞R2分别为(3.39±0.21)/s、(3.16±0.32)/s,SPIO-脂质体联合标记组及USPIO-脂质体联合标记组R2分别为(4.03±025)/s、(3.61±0.32)/s,SPIO-PLL联合标记组及USPIO-PLL联合标记组R2分别为(5.38±0.52)/s、(4.44±0.35)/s,SPIO、USPIO与PLL联合标记组R2大于SPIO、USPIO与脂质体联合标记组(P＜0.05);而与脂质体联合标记组R2大于单独标记组(P＜0.05);SPIO与USPIO单独标记细胞时R2差异无统计学意义(P＞0.05),SPIO与脂质体或PLL联合标记时R2高于USPIO(P＜0.05).结论SPIO、USPIO单独标记及与PLL、脂质体联合标记均可以成功标记神经干细胞,提高R2,其中SPIO与PLL联合标记效率最高."?', 'no']

我觉得应该是yes

I follow the labeling described in the CLUE paper: https://arxiv.org/pdf/2004.05986.pdf (Table 5), and label 0 corresponds to false.

Hmm, so CSL appears to be very noisy, see this issue.
The example from the paper also appears with both labels..
Let's leave it as is, but not sure we should use it for fine-tuning

Muennighoff · 2022-07-27T16:38:12Z

promptsource/templates/clue/csl/templates.yaml

+    name: write_keywords_after_abstract
+    reference: ''
+  2e851dd2-2677-415a-ad90-5d885aa91fdc: !Template
+    answer_choices: no ||| yes


Suggested change

answer_choices: no ||| yes

answer_choices: yes ||| no

Disagree for the reason given above (In CLUE paper, the label 0 corresponds to false).

Muennighoff · 2022-07-27T16:38:22Z

promptsource/templates/clue/csl/templates.yaml

+    name: generate_keywords
+    reference: ''
+  aaf47f6f-fd8f-4180-8d85-e4c7df088ac6: !Template
+    answer_choices: no ||| yes


Suggested change

answer_choices: no ||| yes

answer_choices: yes ||| no

Disagree for the reason given above (In CLUE paper, the label 0 corresponds to false).

promptsource/templates/clue/drcd/templates.yaml

promptsource/templates/clue/tnews/templates.yaml

Co-authored-by: Niklas Muennighoff <[email protected]>

…answer choices.

thomasw21 · 2022-07-30T07:58:05Z

promptsource/templates/clue/afqmc/templates.yaml

+    jinja: 'Do "{{ sentence1 }}" and "{{ sentence2 }}" express the same thing?
+
+      |||
+
+      {{ answer_choices[label] }}'


Stupid question: How does this generate samples exactly? in particular with \n and whitespaces in the beginning and the end? Does it get trimmed all the time?

Yeah the \n and whitespaces before & after ||| get trimmed away

clue prompt templates

45f4391