Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add detailed calibration metrics #1247

Merged
merged 4 commits into from
Dec 16, 2022
Merged

Add detailed calibration metrics #1247

merged 4 commits into from
Dec 16, 2022

Conversation

dtsip
Copy link
Collaborator

@dtsip dtsip commented Dec 14, 2022

Added a "targeted evaluation" for calibration where we display additional metrics for some scenarios.

In terms of infrastructure:

  • Added logic to handle the _detailed metric groups (e.g., calibration_detailed).
  • Allowed RunGroups to specify metric_groups that override those of subgroups. This allows us to use the detailed version of a metric group without needing to specify scenario groups again.
  • Orthogonal: Added a subgroup_metric_groups_hidden list that allows us to hide some metric groups (e.g., since we run some ablations without perturbations, we hide Robustness there).

In terms of annotation in schema.yaml, I was not sure about the following (cc @AnanyaKumar @rishibommasani)

  • Which calibration metrics we want, see calibration_detailed
  • Which scenarios we want to look at, see the "Calibration" targeted evaluation

Resolves #1125 .

# deduplicate, remove basic metric group if we include the detailed one, remove hidden metric groups
all_metric_groups = [
metric_group
for metric_group in dict.fromkeys(all_metric_groups)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just do set(all_metric_groups) instead of dict.fromkeys(all_metric_groups)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I'm using dict to preserve the order of the metric_groups. (For set the order is undefined)

metric_group
for metric_group in dict.fromkeys(all_metric_groups)
if f"{metric_group}_detailed" not in all_metric_groups
and metric_group not in group.subgroup_metric_groups_hidden
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this has the (perhaps intended?) effect that if subgroup_metric_groups_hidden contains only robustness and the metric_groups contains robustness_detailed, then robustness_detailed will still be included.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that's a good point that was definitely not intended!

But thinking a bit more about it, I think it actually makes sense to keep as-is?

@dtsip
Copy link
Collaborator Author

dtsip commented Dec 16, 2022

@AnanyaKumar @rishibommasani : for detailed calibration, what do we want to look at in terms of
a. calibration metrics
b. scenarios

@dtsip
Copy link
Collaborator Author

dtsip commented Dec 16, 2022

Added all the calibration metrics and used IMDB/MMLU/Raft/CivilComments as scenarios for now. Feel free to chance.

@dtsip dtsip merged commit 21ae634 into main Dec 16, 2022
@dtsip dtsip deleted the dtsip-calibration branch December 16, 2022 04:54
@AnanyaKumar
Copy link
Collaborator

Hey, I'm sorry for the late reply! Thank you for getting this done Dimitris!

Metrics: I think adding all the calibration metrics in detailed results sounds good.

Scenarios: I think pretty much any scenario should support calibration now! We support calibration for generation tasks as well.

@rishibommasani in the HELM paper we don't evaluate calibration on XSUM. Is xsum a generation problem, with support for exact_match? In this case we can add calibration for xsum as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

show various calibration metrics
3 participants