Fix zero gradients for ppo-clipped vf #22171

smorad · 2022-02-07T13:49:30Z

Why are these changes needed?

The PPO value loss calculation returns a zero-gradient when clipping is applied and vf_loss2 is selected, because prev_value_fn_out is from the SampleBatch which doesn't track gradients. Furthermore, the logic itself is a bit convoluted. See the related issue for a more in-depth description.

Related issue number

Closes #19291

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

smorad · 2022-02-07T17:22:43Z

Here are some preliminary results using the tuned cartpole example.

As you can see, the mean value error (plotted as vf_loss) is significantly lower after using the fix. Cartpole reward is in [0,200], there is no reason the mean value estimate (i.e. error at a single timestep) should have an error > 1000.

(Before)

(After)

Cartpole is a simple environment that can be solved using vanilla policy gradient, so the value function has little effect on final reward. I suspect more challenging environments would see a significant reward disparity between the old and new value functions.

gjoliver

wow, this fixes the long outstanding value loss calculation?
thanks so much man!

avnishn · 2022-02-07T19:58:18Z

Is this the same case on the tf branch? Would be good to apply the fix there as well.

smorad · 2022-02-08T10:24:39Z

Tensorflow does indeed do the weird clipping as well:

ray/rllib/agents/ppo/ppo_tf_policy.py

Lines 115 to 129 in 7f1bacc

    
           if policy.config["use_critic"]: 
        
               prev_value_fn_out = train_batch[SampleBatch.VF_PREDS] 
        
               vf_loss1 = tf.math.square( 
        
                   value_fn_out - train_batch[Postprocessing.VALUE_TARGETS] 
        
               ) 
        
               vf_clipped = prev_value_fn_out + tf.clip_by_value( 
        
                   value_fn_out - prev_value_fn_out, 
        
                   -policy.config["vf_clip_param"], 
        
                   policy.config["vf_clip_param"], 
        
               ) 
        
               vf_loss2 = tf.math.square( 
        
                   vf_clipped - train_batch[Postprocessing.VALUE_TARGETS] 
        
               ) 
        
               vf_loss = tf.maximum(vf_loss1, vf_loss2) 
        
               mean_vf_loss = reduce_mean_valid(vf_loss)

avnishn · 2022-02-08T14:50:58Z

We'll need to make this change uniformly to both frameworks. Can you submit the change for tf as well? Thx 😊

smorad · 2022-02-13T15:29:49Z

Results after tf fix:

smorad · 2022-02-13T15:41:29Z

@sven1977 I think it's worth setting the default vf_clip_param to inf, essentially ignoring it. Clamping will set the value function gradient to zero via the chain rule: https://discuss.pytorch.org/t/exluding-torch-clamp-from-backpropagation-as-tf-stop-gradient-in-tensorflow/52404. So if vf_loss > policy.config["clip_param"], the gradient for the value function becomes zero.

A nicer idea would be to clip the value function gradient instead of the loss, or clip the value function error before taking the mean (this will produce zero-gradients for individual predictions, rather than the entire train batch). But we can discuss this another time. I think vf_clip_param will generally hurt newcomers, rather than help them. It makes sense this should be explicitly set by the user only when they know what they're doing.

sven1977

Awesome! Thanks for the fixes @smorad !

Fix zero gradients for ppo-clipped vf

ddd1160

smorad requested review from avnishn, gjoliver and sven1977 as code owners February 7, 2022 13:49

actually use new vf loss in total loss

3db1067

gjoliver approved these changes Feb 7, 2022

View reviewed changes

avnishn mentioned this pull request Feb 7, 2022

[rllib] Should remove vf_clip param from PPO #8908

Closed

Fix tf ppo loss too

4be72ca

sven1977 approved these changes Feb 15, 2022

View reviewed changes

sven1977 merged commit 5d52b59 into ray-project:master Feb 15, 2022

simonsays1980 pushed a commit to simonsays1980/ray that referenced this pull request Feb 27, 2022

[RLlib] Fix zero gradients for ppo-clipped vf (ray-project#22171)

1d2a3cc

AlexanderAbernathy mentioned this pull request Jul 27, 2024

[Bug Report] actor's std becomes "nan" during PPO training leggedrobotics/rsl_rl#33

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix zero gradients for ppo-clipped vf #22171

Fix zero gradients for ppo-clipped vf #22171

smorad commented Feb 7, 2022 •

edited

Loading

smorad commented Feb 7, 2022 •

edited

Loading

gjoliver left a comment

avnishn commented Feb 7, 2022

smorad commented Feb 8, 2022

avnishn commented Feb 8, 2022

smorad commented Feb 13, 2022

smorad commented Feb 13, 2022 •

edited

Loading

sven1977 left a comment

Fix zero gradients for ppo-clipped vf #22171

Fix zero gradients for ppo-clipped vf #22171

Conversation

smorad commented Feb 7, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

smorad commented Feb 7, 2022 • edited Loading

gjoliver left a comment

Choose a reason for hiding this comment

avnishn commented Feb 7, 2022

smorad commented Feb 8, 2022

avnishn commented Feb 8, 2022

smorad commented Feb 13, 2022

smorad commented Feb 13, 2022 • edited Loading

sven1977 left a comment

Choose a reason for hiding this comment

smorad commented Feb 7, 2022 •

edited

Loading

smorad commented Feb 7, 2022 •

edited

Loading

smorad commented Feb 13, 2022 •

edited

Loading