Add detailed calibration metrics #1247

dtsip · 2022-12-14T06:38:26Z

Added a "targeted evaluation" for calibration where we display additional metrics for some scenarios.

In terms of infrastructure:

Added logic to handle the _detailed metric groups (e.g., calibration_detailed).
Allowed RunGroups to specify metric_groups that override those of subgroups. This allows us to use the detailed version of a metric group without needing to specify scenario groups again.
Orthogonal: Added a subgroup_metric_groups_hidden list that allows us to hide some metric groups (e.g., since we run some ablations without perturbations, we hide Robustness there).

In terms of annotation in schema.yaml, I was not sure about the following (cc @AnanyaKumar @rishibommasani)

Which calibration metrics we want, see calibration_detailed
Which scenarios we want to look at, see the "Calibration" targeted evaluation

Resolves #1125 .

yifanmai · 2022-12-16T00:09:51Z

src/helm/benchmark/presentation/summarize.py

+            # deduplicate, remove basic metric group if we include the detailed one, remove hidden metric groups
+            all_metric_groups = [
+                metric_group
+                for metric_group in dict.fromkeys(all_metric_groups)


Just do set(all_metric_groups) instead of dict.fromkeys(all_metric_groups)

Oh, I'm using dict to preserve the order of the metric_groups. (For set the order is undefined)

yifanmai · 2022-12-16T00:14:31Z

src/helm/benchmark/presentation/summarize.py

+                metric_group
+                for metric_group in dict.fromkeys(all_metric_groups)
+                if f"{metric_group}_detailed" not in all_metric_groups
+                and metric_group not in group.subgroup_metric_groups_hidden


Note that this has the (perhaps intended?) effect that if subgroup_metric_groups_hidden contains only robustness and the metric_groups contains robustness_detailed, then robustness_detailed will still be included.

Oh that's a good point that was definitely not intended!

But thinking a bit more about it, I think it actually makes sense to keep as-is?

dtsip · 2022-12-16T00:40:22Z

@AnanyaKumar @rishibommasani : for detailed calibration, what do we want to look at in terms of
a. calibration metrics
b. scenarios

dtsip · 2022-12-16T04:53:33Z

Added all the calibration metrics and used IMDB/MMLU/Raft/CivilComments as scenarios for now. Feel free to chance.

AnanyaKumar · 2022-12-16T19:08:04Z

Hey, I'm sorry for the late reply! Thank you for getting this done Dimitris!

Metrics: I think adding all the calibration metrics in detailed results sounds good.

Scenarios: I think pretty much any scenario should support calibration now! We support calibration for generation tasks as well.

@rishibommasani in the HELM paper we don't evaluate calibration on XSUM. Is xsum a generation problem, with support for exact_match? In this case we can add calibration for xsum as well

add detailed calibration

1b1bf76

dtsip requested a review from percyliang December 14, 2022 06:38

yifanmai approved these changes Dec 16, 2022

View reviewed changes

dtsip added 3 commits December 15, 2022 19:16

Merge remote-tracking branch 'origin/main' into dtsip-calibration

26b8c07

schema

33d8159

minor

9b2b60c

dtsip merged commit 21ae634 into main Dec 16, 2022

dtsip deleted the dtsip-calibration branch December 16, 2022 04:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add detailed calibration metrics #1247

Add detailed calibration metrics #1247

dtsip commented Dec 14, 2022

yifanmai Dec 16, 2022

dtsip Dec 16, 2022

yifanmai Dec 16, 2022

dtsip Dec 16, 2022

dtsip commented Dec 16, 2022

dtsip commented Dec 16, 2022

AnanyaKumar commented Dec 16, 2022

Add detailed calibration metrics #1247

Add detailed calibration metrics #1247

Conversation

dtsip commented Dec 14, 2022

yifanmai Dec 16, 2022

Choose a reason for hiding this comment

dtsip Dec 16, 2022

Choose a reason for hiding this comment

yifanmai Dec 16, 2022

Choose a reason for hiding this comment

dtsip Dec 16, 2022

Choose a reason for hiding this comment

dtsip commented Dec 16, 2022

dtsip commented Dec 16, 2022

AnanyaKumar commented Dec 16, 2022