Which evaluation infra was used for benchmarking? #17

woffett · 2023-07-24T18:39:30Z

A previous GH issue (here) mentions that a modified version of this script (here) was used to collect MMLU numbers. What about scripts for other benchmarks in the blogpost? As HuggingFace notes here, numbers can vary wildly if different evaluation codebases are used, would be useful to know if HELM vs Eleuther-AI's harness vs any internal benchmarking library was used for Hellaswag, Winogrande, etc. Same applies to the long sequence tasks (e.g. AMI, FD, SCROLLS) as well. Thanks!

tianxie-9 · 2023-09-20T18:17:00Z

We used lm-evaluation-harness for the lm harness scores. For the long sequence tasks, we used our internal code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Which evaluation infra was used for benchmarking? #17

Which evaluation infra was used for benchmarking? #17

woffett commented Jul 24, 2023

tianxie-9 commented Sep 20, 2023

Which evaluation infra was used for benchmarking? #17

Which evaluation infra was used for benchmarking? #17

Comments

woffett commented Jul 24, 2023

tianxie-9 commented Sep 20, 2023