Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of timeouts in Generalized Advantage Estimation #43

Open
mohakbhardwaj opened this issue Nov 1, 2024 · 0 comments
Open

Handling of timeouts in Generalized Advantage Estimation #43

mohakbhardwaj opened this issue Nov 1, 2024 · 0 comments

Comments

@mohakbhardwaj
Copy link

Hello,

Thank you for this great library! I had a question about the handling of timeouts when computing Generalized Advantage Estimation, specifically in the following line

rewards += self.gamma * timeouts * values

If my understanding is correct, when a trajectory ends in a state that is terminal (i.e a bad state like robot falling) it is treated as an absorbing state and hence the TD error is simply reward - value, however, if it is truncated due to episode timing out, the agent still needs to reason about the long term value from the next state. However, in the above line the rewards are simply augmented with the value prediction for that state multiplied by the discount factor. Hence, the TD error for timeout states would be r + \gamma * value - value.

  1. Could you please explain intuitively or mathematically the rationale behind the handling of timeouts in the GAE computation?
  2. When designing an environment, should done be returned as True for both termination and timeout?
  3. Should we interpret done and timeout as corresponding to the next environment state (i.e after physics step) or current state (before physics step)?

Hope the above questions make sense, and happy to clarify more!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant