Disable Materializing Grads #6822

centwang · 2021-02-26T08:41:17Z

Disable materializing grads in forward so that None object will not be converted to a tensor filled with zeros prior to calling backward. We will materialize the output grads by ourselves. If the output grad is None when calling backward, when it's the direct input of backward graph, we will materialize it to all-0 tensor with same shape of output, otherwise, we will use the scalar-0 tensor, because in this case there will be Add node added, and the scalar-0 tensor is always OK as one of inputs of the Add node.

This PR also contains a fix for module gradient builder. With the fix, we will not ignore any output grads from PT, if it's output of one node, we will add an Add node to add the node output with the grad tensor from PT.

Take the BERT large as example, change the user outputs to return all of loss, prediction_scores and seq_relationship_score, but call the backward from loss only. Without set_materialize_grads(False), the max batch size is 80 (on 32G V100), when set to False, the max batch size can reach to 88. Python debugger also confirms that when set to False, the output grads for prediction_scores and seq_relationship_score are None instead of all-zero tensors.

SherlockNoMad · 2021-02-26T18:40:32Z

It's great that this worked!

mrry · 2021-02-27T00:44:58Z

This is a really nice saving, but can you explain what happens in the following corner case, starting from your example:

Take the BERT large as example, change the user outputs to return all of loss, prediction_scores and seq_relationship_score

Now we call backward() on a tensor that's derived from loss and prediction_scores, e.g. with this silly example:

loss, prediction_scores, _ = model.forward()
new_loss = loss + prediction_scores.sum()
new_loss.backward()

Do we do the right thing in this case? It doesn't look like this should change the control flow through ORTModule compared to calling loss.backward(), so I'd be surprised if it worked without more changes....

centwang · 2021-03-01T12:16:44Z

@mrry Thanks for your example. Actually for this case, both loss_grad and prediction_scores_grad will be all-1 tensors, and seq_relationship_score_grad will be None from PY. Set the flag to False will keep seq_relationship_score_grad as None instead of changing it to all-0 tensor, and it will work as current implementation needs loss_grad only.
Actually this reveal another big issue in our gradient graph builder, our current implementation is wrong for this case. Here since prediction_scores_grad is an output of a node in the graph, so during the graph builder, we ignore the all-1 prediction_scores_grad from PT. The right behavior should be to add a Sum node to sum the node output and the all-1 tensor from PT as the final prediction_scores_grad.
I modified the graph builder by adding the Sum node, but seems there are still some other issue during Yield op execution so the calculation result still not correct. I will go on debug this.

mrry · 2021-03-01T17:27:44Z

Thanks for digging into this @iK1D! There are probably some interesting complexity-efficiency tradeoffs to be made here, for example:

(Easiest, least efficient) We could be pessimistic and always assume that any of the forward outputs could have a gradient, but I think that would require us to materialize the grads in every case, and thus we'd waste memory and compute.
(Harder, more efficient) Since the set of forward outputs that have gradients is probably pretty stable from one run to the next, we could optimistically specialize the graph for the common case, and provide some way to fall back to (1) if we get unexpected gradients.
(Hardest, most(?) efficient, and probably needs more thought!) We could disable materializing grads, as you have in this PR, and put some If nodes into the graph to deal with the None case. For example, each backward input could become a tuple of (bool, value), or some other representation that covers the fact that they are "optional". Then we could wrap the backwards computation for each input in the true branch of an If block, with a passthrough in the false branch. Depending on how efficient the control-flow implementation is, we might still want to specialize as in (2).

I'd prefer us to implement (1) in the first place, because we can always specialize the code manually by wrapping the original module in another module that drops the irrelevant outputs.

There are probably other approaches that could work here as well, but wanted to share a brain dump....

SherlockNoMad · 2021-03-01T18:40:27Z

@mrry, The auxiliary loss in the MoE model is actually constructed in a similar way you shown above, so this is not an uncommon case. Sample here: https://aiinfra.visualstudio.com/Lotus/_git/MOE?path=%2Fmoe_module%2Fmoe.py&version=GBmain&line=33&lineEnd=34&lineStartColumn=1&lineEndColumn=1&lineStyle=plain&_a=contents

@iK1D, indeed there is a bug on how we currently construct the gradient graph.

Since we can't tell which module outputs would end up getting the gradient, I think we have two options:

Always assume all outputs has gradients, and trim the graph in the first backward call().
Defer building training graph to the backward() call

+@satyajandhyala for more perspectives, since he has also looked at this problem before.

centwang · 2021-03-02T00:38:49Z

I am thinking Derek's #1, but with some trick. We still set the materialize_grads to False, if it's None, we create a 0-scalar tensor as its grad and pass to ORT. If we can somehow make the SetOutputMLValue work (as the shape is inconsistent), I think the Sum is quick as we have scalar optimization for calculation.

…ze_grads

New changes need to be reviewed again.

SherlockNoMad · 2021-03-03T19:11:34Z

orttraining/orttraining/core/graph/training_op_defs.cc

          if (!hasShape(*typeProto)) {
            continue;
          }
-          propagateShapeFromInputToOutput(ctx, j, i);
+          propagateShapeFromInputToOutput(ctx, i, i);


shape can be different, right? for scaler 0 ?

Scalar-0 is just a workaround during execution. Logically the shape is same. Setting this here can help to infer other shape info in the backward graph.

Aishwarya found an issue in T5 that mem_plan reuse this scalar tensor. I think mem_plan should have some bug, but here I will not set the shape for outputs that might be scalar. Will check the mem_plan issue later.

orttraining/orttraining/python/training/ortmodule.py

orttraining/orttraining/test/python/orttraining_test_ortmodule_api.py

SherlockNoMad · 2021-03-03T19:31:04Z

orttraining/orttraining/test/python/orttraining_test_ortmodule_api.py

+        loss.backward()
+
+    N, D_in, H, D_out = 32, 784, 500, 10
+    pt_model = NeuralNetMultiplePositionalArgumentsMultipleOutputs1(D_in, H, D_out).to(device)


nit. better name, ...MultiOutputsWithDependency

same applies to NeuralNetMultiplePositionalArgumentsMultipleOutputs0
...MultiOutputsWithoutDependency

orttraining/orttraining/core/framework/module_gradient_graph_builder.h

orttraining/orttraining/core/framework/module_gradient_graph_builder.cc

SherlockNoMad

overall, it looks good to me. Thanks a lot for the fix!

SherlockNoMad · 2021-03-04T01:18:35Z

with this PR, we should reach two assumptions that will reduce the complexity.

Yield's inputs and outputs are always 1 to 1 mapping
we no longer need to reorder the graph outputs, so Yield's input order should exactly match module's forward output order, and Yield's output order should match model's backward input order.

SherlockNoMad · 2021-03-05T19:23:31Z

orttraining/orttraining/test/python/_test_helpers.py

@@ -156,7 +158,10 @@ def assert_gradients_match_and_reset_gradient(ort_model, pt_model, reset_gradien
        pt_name, pt_param = pt_named_param

        assert pt_name in ort_name
-        assert torch.allclose(ort_param.grad, pt_param.grad, rtol=rtol, atol=atol)
+        if pt_name in none_pt_params:
+            assert pt_param.grad is None


nit. shall we also assert ort_param.grad as zeros?

SherlockNoMad · 2021-03-05T19:26:27Z

orttraining/orttraining/test/python/orttraining_test_ortmodule_api.py

+    run_step0(pt_model, pt_x1, pt_x2)
+    run_step0(ort_model, ort_x1, ort_x2)
+
+    assert torch.allclose(ort_x1.grad, pt_x1.grad)


let's also assert on the forward results?
I caught a mismatch on forward result in test_input_requires_grad_backward_creates_input_grad_as_required1
see
https://msdata.visualstudio.com/Vienna/_sprints/taskboard/ONNX%20Training/Vienna/Cobalt?workitem=1064208

SherlockNoMad · 2021-03-05T19:26:59Z

orttraining/orttraining/test/python/orttraining_test_ortmodule_api.py

+    run_step(pt_model, pt_x1, pt_x2)
+    run_step(ort_model, ort_x1, ort_x2)
+
+    _test_helpers.assert_gradients_match_and_reset_gradient(ort_model, pt_model)


also assert forward results?

SherlockNoMad · 2021-03-05T19:32:13Z

orttraining/orttraining/core/graph/training_op_defs.cc

-        "required_grad",
-        "The indices of the outputs that require gradient outputs.",
-        AttributeProto::INTS)
+      .Attr("full_shape_outputs", "The indices of the outputs that must have full shape.", AttributeProto::INTS)


also add an assert in yield's kernel?
output shape should match input shape if "full_shape" is true.

SherlockNoMad

Remaining comments are non blocking, and they are mostly for defensive programming practice.

disable materialize grads

1f1147d

centwang added the component:ortmodule label Feb 26, 2021

centwang requested a review from SherlockNoMad February 26, 2021 08:41

centwang requested review from baijumeswani, BowenBao, liqunfu, spandantiwari and thiagocrepaldi as code owners February 26, 2021 08:41

SherlockNoMad previously approved these changes Feb 26, 2021

View reviewed changes

Merge from master

2ef5b2f

centwang added 2 commits March 2, 2021 11:54

Merge branch 'thiagofc/ortmodule-api' into weicwang/disable_materiali…

37e69f4

…ze_grads

gradient builder bugfix

0d12602

centwang requested a review from a team as a code owner March 2, 2021 09:34

centwang requested a review from SherlockNoMad March 2, 2021 09:35

centwang added 2 commits March 2, 2021 17:49

fix ut

521fbe9

fix ut

bd6ae1d