Handling of `timeouts` in Generalized Advantage Estimation #43

mohakbhardwaj · 2024-11-01T04:38:50Z

Hello,

Thank you for this great library! I had a question about the handling of timeouts when computing Generalized Advantage Estimation, specifically in the following line

rsl_rl/rsl_rl/algorithms/ppo.py

Line 346 in 96393c4

rewards += self.gamma * timeouts * values

If my understanding is correct, when a trajectory ends in a state that is terminal (i.e a bad state like robot falling) it is treated as an absorbing state and hence the TD error is simply reward - value, however, if it is truncated due to episode timing out, the agent still needs to reason about the long term value from the next state. However, in the above line the rewards are simply augmented with the value prediction for that state multiplied by the discount factor. Hence, the TD error for timeout states would be r + \gamma * value - value.

Could you please explain intuitively or mathematically the rationale behind the handling of timeouts in the GAE computation?
When designing an environment, should done be returned as True for both termination and timeout?
Should we interpret done and timeout as corresponding to the next environment state (i.e after physics step) or current state (before physics step)?

Hope the above questions make sense, and happy to clarify more!

The text was updated successfully, but these errors were encountered:

mohakbhardwaj mentioned this issue Nov 1, 2024

time out bootstrapping possible bug. #20

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling of `timeouts` in Generalized Advantage Estimation #43

Handling of `timeouts` in Generalized Advantage Estimation #43

mohakbhardwaj commented Nov 1, 2024

Handling of timeouts in Generalized Advantage Estimation #43

Handling of timeouts in Generalized Advantage Estimation #43

Comments

mohakbhardwaj commented Nov 1, 2024

Handling of `timeouts` in Generalized Advantage Estimation #43

Handling of `timeouts` in Generalized Advantage Estimation #43