Documentation: Evaluation run lifecycle #2506

yifanmai · 2024-03-25T21:54:49Z

No description provided.

farzaank · 2024-03-27T16:52:44Z

docs/evaluation_run_lifecycle.md

@@ -0,0 +1,46 @@
+# Evaluation Run Lifecycle
+
+Each invocation of `helm-run` will run a number of **evaluation runs**. Each evaluation run uses a single scenario and a single model, and is performed independently form other evaluation runs. Evaluation runs are usually executed serially / one at a time by the default runner, though some alternate runners (e.g. `SlurmRunner`) may execute evaluation runs in parallel.


form -> from

farzaank · 2024-03-27T16:53:57Z

docs/evaluation_run_lifecycle.md

+4. Send the requests to the models and receives the request responses.
+5. Compute the per-instance stats and aggregate them to per-run stats.
+
+The following code and data objects are responsible involved in an evaluation run:


Very helpful that the 5 points here map back to the previous 5

farzaank · 2024-03-27T16:56:59Z

docs/evaluation_run_lifecycle.md

+
+When a user runs `helm-run`, the evaluation runner will perform a number of evaluation runs, each specified by a `RunSpec`. However, the user typically does not provide the `RunSpec`s directly. Instead, the `RunSpec`s are produced by **run spec functions**. The user instead passes one or more **run entries** to `helm-run`, which are short strings (e.g. `mmlu:subject=anatomy,model=openai/gpt2`) that specify how to invoke the run spec functions to get the actual `RunSpec`s.
+
+The run entry format is explained further on its own documentation.


Would be useful to link to that here

Yes, and then we can just say "further here"

teetone · 2024-03-30T07:22:18Z

docs/evaluation_run_lifecycle.md

+3. An `AdapterSpec` specifies an `Adapter` instance.
+4. `MetricSpec`s specifies `Metric` instances.
+
+Note: The `RunSpec` does not contain a `ClientSpec` specifies the `Client` instance. Instead, the `RunSpec` specifies the name of the model deployment inside `AdapterSpec`. During the evaluation run, the model deployment name is used to retreive the `ClientSpec` from built-in or user-provided model deployment configurations, which is then used to construct the `Client`. This late binding allows the HELM user to perform user-specific configuration of clients, such as changing the type or location of the model inference platform for the model.


retreive -> retrieve

teetone · 2024-03-30T07:22:51Z

docs/evaluation_run_lifecycle.md

+
+The following code and data objects are responsible involved in an evaluation run:
+
+1. A `Scenario` provides the in context learning and evaluation `Instance`s.


nit: in-context

teetone · 2024-03-30T07:24:04Z

docs/evaluation_run_lifecycle.md

+2. A `DataAugmenter` takes in base `Instance` and generates perturbed `Instance`.
+3. A `Adapter` transforms in-context learning instances and evaluation instances into model inference `Request`s.
+4. A `Client` sends the `Requests` to the models and receives `RequestResponse`s.
+5. `Metrics`s take in `RequestState`s (which each contain a `Instance`, `Request`,`RequestResponse`, and additional instance context) and compute aggregated adn per-instanace `Stat`s.


typo: "and" I think (towards the end of the sentence).

teetone · 2024-03-30T07:26:09Z

docs/evaluation_run_lifecycle.md

+1. Get in-context learning and evaluation instances from scenario. Each instance has an input (e.g. question) and a set of reference outputs (e.g. multiple choice options).
+2. (Advanced) Run _data augmenters / perturbations_ on the base instances to generate perturbed instances.
+3. Perform _adaptation_ to transform the in-context learning instances and evaluation instances into model inference requests, which contain prompts and other request parameters such as request temperature and stop sequences.
+4. Send the requests to the models and receives the request responses.


receives should be receive

teetone · 2024-03-30T07:27:58Z

docs/evaluation_run_lifecycle.md

+4. Send the requests to the models and receives the request responses.
+5. Compute the per-instance stats and aggregate them to per-run stats.
+
+The following code and data objects are responsible involved in an evaluation run:


responsible involved in should either be responsible for or just involved in

teetone · 2024-03-30T07:28:47Z

docs/evaluation_run_lifecycle.md

+
+1. Specifications (`RunSpec`, `ScenarioSpec`, `DataAugmenterSpec`, `AdapterSpec`, `ClientSpec`, and `MetricsSpec`) are serializable. They may be written to evaluation run output files, to provide a record of how the evaluation run was configured and how to reproduce it.
+2. Code objects (`Scenario`, `DataAugmenter`, `Adapter`, `Client`, `Metric`) are _not_ serializable. These contain program logic used for by the evlauation run. Users can implement custom subclasses of these objects if needed.
+3. Data objects (`Instance`, `Request`, `Response`, `Stat`) are serializable. These are typcically produced as outputs of code objects and written to the evaluation run output files.


typo: typically

teetone · 2024-03-30T07:29:34Z

docs/evaluation_run_lifecycle.md

+
+When a user runs `helm-run`, the evaluation runner will perform a number of evaluation runs, each specified by a `RunSpec`. However, the user typically does not provide the `RunSpec`s directly. Instead, the `RunSpec`s are produced by **run spec functions**. The user instead passes one or more **run entries** to `helm-run`, which are short strings (e.g. `mmlu:subject=anatomy,model=openai/gpt2`) that specify how to invoke the run spec functions to get the actual `RunSpec`s.
+
+The run entry format is explained further on its own documentation.


Yes, and then we can just say "further here"

teetone · 2024-03-30T07:30:13Z

docs/evaluation_run_lifecycle.md

+
+The objects above can be grouped into three categories:
+
+1. Specifications (`RunSpec`, `ScenarioSpec`, `DataAugmenterSpec`, `AdapterSpec`, `ClientSpec`, and `MetricsSpec`) are serializable. They may be written to evaluation run output files, to provide a record of how the evaluation run was configured and how to reproduce it.


Don't need the comma after run output files

yifanmai added 2 commits March 25, 2024 13:42

Documentation: Evaluation run lifecycle

5c07761

More fixes

430ec33

yifanmai requested review from percyliang and farzaank March 25, 2024 21:54

farzaank reviewed Mar 27, 2024

View reviewed changes

teetone requested changes Mar 30, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation: Evaluation run lifecycle #2506

Documentation: Evaluation run lifecycle #2506

yifanmai commented Mar 25, 2024

farzaank Mar 27, 2024

farzaank Mar 27, 2024

farzaank Mar 27, 2024

teetone Mar 30, 2024

teetone Mar 30, 2024

teetone Mar 30, 2024

teetone Mar 30, 2024

teetone Mar 30, 2024

teetone Mar 30, 2024

teetone Mar 30, 2024

teetone Mar 30, 2024

teetone Mar 30, 2024

		@@ -0,0 +1,46 @@
		# Evaluation Run Lifecycle

		Each invocation of `helm-run` will run a number of evaluation runs. Each evaluation run uses a single scenario and a single model, and is performed independently form other evaluation runs. Evaluation runs are usually executed serially / one at a time by the default runner, though some alternate runners (e.g. `SlurmRunner`) may execute evaluation runs in parallel.


		When a user runs `helm-run`, the evaluation runner will perform a number of evaluation runs, each specified by a `RunSpec`. However, the user typically does not provide the `RunSpec`s directly. Instead, the `RunSpec`s are produced by run spec functions. The user instead passes one or more run entries to `helm-run`, which are short strings (e.g. `mmlu:subject=anatomy,model=openai/gpt2`) that specify how to invoke the run spec functions to get the actual `RunSpec`s.

		The run entry format is explained further on its own documentation.


		The following code and data objects are responsible involved in an evaluation run:

		1. A `Scenario` provides the in context learning and evaluation `Instance`s.


		The objects above can be grouped into three categories:

		1. Specifications (`RunSpec`, `ScenarioSpec`, `DataAugmenterSpec`, `AdapterSpec`, `ClientSpec`, and `MetricsSpec`) are serializable. They may be written to evaluation run output files, to provide a record of how the evaluation run was configured and how to reproduce it.

Documentation: Evaluation run lifecycle #2506

Are you sure you want to change the base?

Documentation: Evaluation run lifecycle #2506

Conversation

yifanmai commented Mar 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment