You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for this great library! I had a question about the handling of timeouts when computing Generalized Advantage Estimation, specifically in the following line
If my understanding is correct, when a trajectory ends in a state that is terminal (i.e a bad state like robot falling) it is treated as an absorbing state and hence the TD error is simply reward - value, however, if it is truncated due to episode timing out, the agent still needs to reason about the long term value from the next state. However, in the above line the rewards are simply augmented with the value prediction for that state multiplied by the discount factor. Hence, the TD error for timeout states would be r + \gamma * value - value.
Could you please explain intuitively or mathematically the rationale behind the handling of timeouts in the GAE computation?
When designing an environment, should done be returned as True for both termination and timeout?
Should we interpret done and timeout as corresponding to the next environment state (i.e after physics step) or current state (before physics step)?
Hope the above questions make sense, and happy to clarify more!
The text was updated successfully, but these errors were encountered:
Hello,
Thank you for this great library! I had a question about the handling of
timeouts
when computing Generalized Advantage Estimation, specifically in the following linersl_rl/rsl_rl/algorithms/ppo.py
Line 346 in 96393c4
If my understanding is correct, when a trajectory ends in a state that is terminal (i.e a bad state like robot falling) it is treated as an absorbing state and hence the TD error is simply
reward - value
, however, if it is truncated due to episode timing out, the agent still needs to reason about the long term value from the next state. However, in the above line the rewards are simply augmented with the value prediction for that state multiplied by the discount factor. Hence, the TD error for timeout states would ber + \gamma * value - value
.timeouts
in the GAE computation?done
be returned asTrue
for both termination and timeout?done
andtimeout
as corresponding to the next environment state (i.e after physics step) or current state (before physics step)?Hope the above questions make sense, and happy to clarify more!
The text was updated successfully, but these errors were encountered: