feat: add custom reducers to estimators [DET-3098] #837

rb-determined-ai · 2020-07-07T20:30:10Z

Description

Introduce custom reducers for estimator trial. From the docstring in the PR:

During distributed evaluation, many types of metrics calculated via ``tf.metrics`` or
``tf.keras.metrics`` cannot be aggregated properly from the per-slot final metrics
calculated by each separate Estimator replica. One example is ``tf.metrics.auc``, where
the ROC AUC calculated over predictions and labels from a full dataset cannot be derived
from a set of ROC AUC metrics evaluated over the shards of a dataset. However, with
``make_metric``, the ROC AUC could be calculated in distributed training by calling
``sklearn.metrics.roc_auc_score`` in a custom ``reducer`` function.

Test Plan

Lots of manual testing, in addition to adding a new unit test and a new parallel test.

harness/determined/estimator/_estimator_context.py

harness/determined/estimator/_reducer.py

harness/determined/estimator/_estimator_context.py

rb-determined-ai · 2020-07-15T21:05:47Z

@aaron276h This is now ready for a "for-realsies" review. The incremental update since the last time you looked is:

added unit and parallel e2e test
fixed a bug where the tf.control_dependencies() was somehow triggering the call to _DistributedMetric.result() twice, resulting in building two allgather ops for every metric in the graph. I fixed this by pre-building every result op and returning the cached op when result() is called.
the _allgather_ops list in the context has to be reset after every .evaluate() call, because every .evaluate() is going to build a clean graph, and you don't want to call tf.control_dependencies() on ops from the old graph.

aaron276h · 2020-07-16T15:01:35Z

e2e_tests/tests/fixtures/estimator_dataset/model.py

+        return sum(per_slot_metrics)
+
+
+class EstimatorDebugTrial(estimator.EstimatorTrial):


non-blocking: can we rename this from Debug to something else

harness/determined/estimator/_estimator_context.py

harness/determined/estimator/_reducer.py

harness/tests/experiment/fixtures/estimator_linear_model.py

e2e_tests/tests/experiment/test_tf_estimator.py

aaron276h · 2020-07-17T19:15:03Z

docs/reference/api/estimator.txt

+Reducing Metrics
+~~~~~~~~~~~~~~~~
+
+Determined supports proper reduction of arbitrary metrics during distributed


I think worth calling out that this is for validation metrics

I made the following edits to the docstrings:

Reducing Metrics ~~~~~~~~~~~~~~~~ -Determined supports proper reduction of arbitrary metrics during distributed -training by allowing users to define custom reducers for their metrics. Custom -reducers can be either a function or an implementation of the +Determined supports proper reduction of arbitrary validation metrics during +distributed training by allowing users to define custom reducers for their +metrics. Custom reducers can be either a function or an implementation of the

def make_metric(..) -> tf.keras.metrics.Metric: """ - Return an estimator-compatible metric which will be calculated properly, even during - distributed training. + Return an estimator-compatible validation metric which will be calculated properly, even + during distributed evaluation.

class MetricReducer: """ - Efficiently aggregating metrics across a multi-slot distributed evaluation is done in two steps: + Efficiently aggregating validation metrics across a multi-slot distributed evaluation is done + in two steps:

aaron276h · 2020-07-20T12:42:06Z

harness/determined/estimator/_reducer.py

+
+        self.update_state(metric)
+
+        @self._det_context._build_allgather_op


question: why make this a decorator and not just a function call?

To keep the MetricReducer a pure python API.

A natural tensorflowy way to write this would be to set the granularity of py_func such that the final metric reduction is done in two steps: 1. allgather the final outputs of accumulate(), 2. apply the user's cross_slot_reduce to the algathered stuff.

That would be natural because the only parts of the graph which have to be serialized are as small as possible; only step 1, which is the network communication. Also, the allgather op would just do allgather and you could easily have a function to build a generic allgather that other ops would connect to. The drawback is that you would have to convert the output of the accumulate() function to tensorflow types since those outputs would have to pass through the graph. The output of the final allgather call would also have to have a declared dtype, since that's a requirement of py_func, which adds another layer of configurability we would need in the interface.

What I did was set the granularity of the py_func such both of the above two steps were accomplished within a single py_func. The input to cross_slot_reduce() is then much, much easier to reason about for the user, since the user will get their exact outputs rather than some tensorflow-casted outputs. The cost is that the entire cross_slot_reduce becomes part of the serialized section of graph operations.

Then with the granularity of py_func I chose, I think a decorator is the best way to represent how to parameterize the _build_allgather_op, but it's definitely a little confusing.

Talked offline, the question was about calling _build_allgather_op directly rather than applying it as a decorator. I don't care either way, so I took the direct call approach.

This reverts commit fad06e9.

rb-determined-ai added the User-facing API Change label Jul 7, 2020

rb-determined-ai requested a review from aaron276h July 7, 2020 20:30

rb-determined-ai self-assigned this Jul 7, 2020

cla-bot bot added the cla-signed label Jul 7, 2020

aaron276h reviewed Jul 8, 2020

View reviewed changes

rb-determined-ai force-pushed the custom-reducer branch from b8654af to bce7d89 Compare July 14, 2020 02:12

rb-determined-ai assigned aaron276h and unassigned rb-determined-ai Jul 14, 2020

rb-determined-ai force-pushed the custom-reducer branch 2 times, most recently from 3c026ca to a695e1e Compare July 14, 2020 16:26

aaron276h reviewed Jul 14, 2020

View reviewed changes

harness/determined/estimator/_estimator_context.py Outdated Show resolved Hide resolved

harness/determined/estimator/_estimator_context.py Outdated Show resolved Hide resolved

harness/determined/estimator/_estimator_context.py Outdated Show resolved Hide resolved

aaron276h assigned rb-determined-ai and unassigned aaron276h Jul 14, 2020

rb-determined-ai force-pushed the custom-reducer branch 2 times, most recently from 092018c to 8025242 Compare July 15, 2020 20:56

rb-determined-ai assigned aaron276h and unassigned rb-determined-ai Jul 15, 2020

rb-determined-ai marked this pull request as ready for review July 15, 2020 20:58

aaron276h reviewed Jul 16, 2020

View reviewed changes

aaron276h assigned rb-determined-ai and unassigned aaron276h Jul 16, 2020

rb-determined-ai force-pushed the custom-reducer branch 2 times, most recently from 78228f7 to 28f5352 Compare July 17, 2020 18:03

rb-determined-ai assigned aaron276h and unassigned rb-determined-ai Jul 17, 2020

aaron276h approved these changes Jul 20, 2020

View reviewed changes

aaron276h assigned rb-determined-ai and unassigned aaron276h Jul 20, 2020

rb-determined-ai force-pushed the custom-reducer branch from 28f5352 to 6255f5c Compare July 20, 2020 14:36

rb-determined-ai assigned aaron276h and unassigned rb-determined-ai Jul 20, 2020

rb-determined-ai force-pushed the custom-reducer branch from 6255f5c to c611549 Compare July 20, 2020 15:40

rb-determined-ai assigned rb-determined-ai and unassigned aaron276h Jul 20, 2020

rb-determined-ai force-pushed the custom-reducer branch 2 times, most recently from 7abcd24 to fe9f82e Compare July 20, 2020 18:08

feat: support custom reducers for estimators

822cbe3

rb-determined-ai force-pushed the custom-reducer branch from fe9f82e to 822cbe3 Compare July 20, 2020 20:11

rb-determined-ai changed the title ~~feat: add custom reducers to estimators~~ feat: add custom reducers to estimators [DET-3098] Jul 20, 2020

rb-determined-ai merged commit fad06e9 into determined-ai:master Jul 20, 2020

rb-determined-ai deleted the custom-reducer branch July 20, 2020 22:16

rb-determined-ai added a commit that referenced this pull request Jul 20, 2020

Revert "feat: support custom reducers for estimators (#837)"

aec6463

This reverts commit fad06e9.

rb-determined-ai mentioned this pull request Jul 20, 2020

chore: revert "feat: add custom reducers to estimators [DET-3098]" #914

Merged

rb-determined-ai added a commit that referenced this pull request Jul 20, 2020

Revert "feat: support custom reducers for estimators (#837)" (#914)

97744cd

This reverts commit fad06e9.

dannysauer added this to the 0.12.12 milestone Feb 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add custom reducers to estimators [DET-3098] #837

feat: add custom reducers to estimators [DET-3098] #837

rb-determined-ai commented Jul 7, 2020 •

edited

Loading

rb-determined-ai commented Jul 15, 2020

aaron276h Jul 16, 2020

rb-determined-ai Jul 16, 2020

aaron276h Jul 17, 2020

rb-determined-ai Jul 20, 2020

aaron276h Jul 20, 2020

rb-determined-ai Jul 20, 2020

rb-determined-ai Jul 20, 2020

		return sum(per_slot_metrics)


		class EstimatorDebugTrial(estimator.EstimatorTrial):


		self.update_state(metric)

		@self._det_context._build_allgather_op

feat: add custom reducers to estimators [DET-3098] #837

feat: add custom reducers to estimators [DET-3098] #837

Conversation

rb-determined-ai commented Jul 7, 2020 • edited Loading

Description

Test Plan

rb-determined-ai commented Jul 15, 2020

aaron276h Jul 16, 2020

Choose a reason for hiding this comment

rb-determined-ai Jul 16, 2020

Choose a reason for hiding this comment

aaron276h Jul 17, 2020

Choose a reason for hiding this comment

rb-determined-ai Jul 20, 2020

Choose a reason for hiding this comment

aaron276h Jul 20, 2020

Choose a reason for hiding this comment

rb-determined-ai Jul 20, 2020

Choose a reason for hiding this comment

rb-determined-ai Jul 20, 2020

Choose a reason for hiding this comment

rb-determined-ai commented Jul 7, 2020 •

edited

Loading