-
Notifications
You must be signed in to change notification settings - Fork 250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add detailed calibration metrics #1247
Conversation
# deduplicate, remove basic metric group if we include the detailed one, remove hidden metric groups | ||
all_metric_groups = [ | ||
metric_group | ||
for metric_group in dict.fromkeys(all_metric_groups) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just do set(all_metric_groups)
instead of dict.fromkeys(all_metric_groups)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I'm using dict
to preserve the order of the metric_groups. (For set
the order is undefined)
metric_group | ||
for metric_group in dict.fromkeys(all_metric_groups) | ||
if f"{metric_group}_detailed" not in all_metric_groups | ||
and metric_group not in group.subgroup_metric_groups_hidden |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this has the (perhaps intended?) effect that if subgroup_metric_groups_hidden
contains only robustness
and the metric_groups
contains robustness_detailed
, then robustness_detailed
will still be included.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh that's a good point that was definitely not intended!
But thinking a bit more about it, I think it actually makes sense to keep as-is?
@AnanyaKumar @rishibommasani : for detailed calibration, what do we want to look at in terms of |
Added all the calibration metrics and used IMDB/MMLU/Raft/CivilComments as scenarios for now. Feel free to chance. |
Hey, I'm sorry for the late reply! Thank you for getting this done Dimitris! Metrics: I think adding all the calibration metrics in detailed results sounds good. Scenarios: I think pretty much any scenario should support calibration now! We support calibration for generation tasks as well. @rishibommasani in the HELM paper we don't evaluate calibration on XSUM. Is xsum a generation problem, with support for exact_match? In this case we can add calibration for xsum as well |
Added a "targeted evaluation" for calibration where we display additional metrics for some scenarios.
In terms of infrastructure:
_detailed
metric groups (e.g.,calibration_detailed
).RunGroup
s to specifymetric_groups
that override those of subgroups. This allows us to use the detailed version of a metric group without needing to specify scenario groups again.subgroup_metric_groups_hidden
list that allows us to hide some metric groups (e.g., since we run some ablations without perturbations, we hide Robustness there).In terms of annotation in
schema.yaml
, I was not sure about the following (cc @AnanyaKumar @rishibommasani)calibration_detailed
Resolves #1125 .