Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MDP Version #9

Closed
zsunberg opened this issue Jun 11, 2020 · 23 comments
Closed

MDP Version #9

zsunberg opened this issue Jun 11, 2020 · 23 comments

Comments

@zsunberg
Copy link

First of all, it's very cool to see AlphaZero implemented in Julia! I have always thought that Julia is a really good tool for this type of thing.

This comment is related to #4, but maybe is a slightly different perspective. It would be very cool to also have a version of this that works with MDPs, either using RLBase as @findmyway suggested in #4, or POMDPs.jl (which I and some colleagues work on), or RLInterface.jl. This would need to be a considerably different implementation because the MCTS would be approximating an expectimax tree instead of a minimax tree.

Just thought I should start this issue as a stub in case anyone wants to pick it up, and to point to some MDP definition interfaces in Julia, and to clarify that the game and MDP versions would need to be different. I and my students would definitely use the package if it had support for MDPs.

@jonathan-laurent
Copy link
Owner

Thanks for your interest in AlphaZero.jl.

It is my plan indeed to add support for MDPs. This requires adding three features:

  1. Support for single player games
  2. Support for intermediate (non-terminal) rewards
  3. Support for stochastic transition functions

Regarding the last point (stochasticity), there are two possible approaches and I would appreciate some opinion / feedback.

Approach 1: with explicit chance nodes

In this approach, we add explicit chance nodes in MCTS and have the play function return a probability distribution over transitions. During UCT evaluation, the value of a chance node is defined by the weighted average of the value of its children.

The advantage of this approach is that it leverages full-knowledge of the transition probabilities. On the other hand, it is restricted to finite distributions with relatively small supports (although optimizations such as node grouping can be used to handle larger chance branching factors).

Approach 2: without explicit chance nodes

The alternative is to have play sample and return a transition. In this case, we can use the current MCTS implementation without modification. Indeed, the current implementation does not make any assumption that the game is deterministic. Nodes are not identified by the sequence of actions that led to it. Rather, a Q function estimate is maintained for every visited state (therefore implementing full node sharing, see discussion in #4). When an action is played during tree search, one of the possible child nodes is sampled randomly through play.

This approach can accommodate arbitrary transition distributions. It also has the advantage of treating deterministic games and stochastic games identically, reducing complexity.

Asymptotically, I am pretty sure both approaches are equivalent. However, I have been looking for an empirical comparison but haven't found any. Does anyone have an opinion on this?

@zsunberg
Copy link
Author

zsunberg commented Jun 12, 2020

The most common way is more similar to approach 2 - you simply sample a single new state and reward from the transition probability distribution.

If you have access to IEEE publications, there is a very clear pseudocode description in algorithm 4.9 of this book chapter: https://ieeexplore-ieee-org.colorado.idm.oclc.org/document/7288642

You can also see one of our implementations here, https://github.com/JuliaPOMDP/MCTS.jl, though this code was written long ago, so I would say it is not the best to get insight from and there are some mistakes that I would do differently now (and plan to in the future).

Happy to review code or help in another way!

@zsunberg
Copy link
Author

It will also be important to be able to handle non-terminal rewards for MDPs (I noticed this was mentioned in #4)

@jonathan-laurent
Copy link
Owner

Thanks!

If you have access to IEEE publications, there is a very clear pseudocode description in algorithm 4.9 of this book chapter: https://ieeexplore-ieee-org.colorado.idm.oclc.org/document/7288642

The link you gave me points to a "University of Colorado" portal. Can you give me the book title?

@zsunberg
Copy link
Author

Oops, haha. It's called "Decision Making under Uncertainty: Theory and Application". This link should work: https://ieeexplore.ieee.org/document/7288642

@jonathan-laurent
Copy link
Owner

Thanks for the reference.
The proposed algorithm pretty much corresponds to what is already implemented, except that it supports intermediate rewards (also, AlphaZero uses a slightly different UCT formula).

@JacksonKearl
Copy link

I’ve been looking for something that can model games with a dice roll to start each turn. In my impression this would require Approach 1 above, as every possible roll should be considered.

How could this be accompanied using approach 2? It seems to me like it would only be able to sample a single possible roll.

@jonathan-laurent
Copy link
Owner

@JacksonKearl Approach 2 would be able to accommodate games with a dice roll.

Yes, during each move of each MCTS simulation, only the result of a single dice roll is explored. But this is ok, as each simulation is going to explore different stochastic choices and the frequency of exploring each choice will be proportional to its probability. Even in a deterministic game, MCTS is never able to explore every possible branch (it differs from minmax in this way).

Even when using Approach 1, a single dice roll would be sampled at each chance node. The only difference is that some info from the exact probability distribution could potentially be used when backpropagating rewards.

@jonathan-laurent
Copy link
Owner

An update: the last release (v0.3.0) generalized the game interface. In particular, it adds support for stochastic games and intermediate rewards. There shouldn't be much work to add support for CommonRLInterface.jl now.

@jonathan-laurent
Copy link
Owner

@zsunberg Do you have a nice MDP environment to suggest that I could use as a test example for MDP support in AlphaZero.jl?

@JacksonKearl
Copy link

@jonathan-laurent are these updates in the docs? I don't see anything here: https://jonathan-laurent.github.io/AlphaZero.jl/stable/reference/game_interface/

@jonathan-laurent
Copy link
Owner

@JacksonKearl The doc seems up-to-date to me (you can compare it with the doc for v0.2.0). In particular, at the top of the page you link too, one can read that "stochastic games and intermediate rewards are supported". Note however that:

  • MDP support is not there yet (it would require implementing support for single-player games)
  • I have decided to wait a bit before advertising these features more heavily as they haven't been thoroughly tested yet.

@zsunberg
Copy link
Author

@jonathan-laurent Awesome. I would say start simple with a Grid World problem (you can google it to see what it is). That way you can compare to the optimal solution from value iteration easily. We have one implemented in the POMDPs.jl interface here: https://github.com/JuliaPOMDP/POMDPModels.jl/blob/master/src/gridworld.jl I have been a bit busy the last few days, but I hope to do some more work on CommonRLInterface this evening and tomorrow, so I will try to implement a grid world problem in that interface (or just provide POMDPs <-> CommonRLInterface compatibility) soon.

@jonathan-laurent
Copy link
Owner

@zsunberg I think grid world is a good suggestion. However, there should not be a unique, fixed world, in which case there is no opportunity for generalization and AlphaZero is useless. I think a nice setting would be to generate a random world at the beginning of each episode and to allow the agent to observe the whole grid.

Thanks for your work on CommonRLInterface.jl. I think it is a really valuable project.

@zsunberg
Copy link
Author

zsunberg commented Jul 1, 2020

However, there should not be a unique, fixed world,...

Yes, I agree that a fixed grid world is of course not a good showcase for AlphaZero :) but I think it is an important first step so that we can verify all of our semantics are the same and it will solve the simplest problems correctly.

@zsunberg
Copy link
Author

@jonathan-laurent I added a grid world example in the common rl interface package. https://github.com/JuliaReinforcementLearning/CommonRLInterface.jl/tree/master/examples Do you think that will work?

@jonathan-laurent
Copy link
Owner

Thanks @zsunberg. I won't try it straight away as my priority right now is to add support for distributed computing but this is in second position on my TODO list.

@zsunberg
Copy link
Author

@jonathan-laurent , cool. By the way, one thing I have heard a bunch about implementing parallel MCTS is that using locks usually doesn't work that well. Instead, it is better to use atomic operations that can just fail if there is a conflict between two threads. Luckily Julia has these: https://docs.julialang.org/en/v1/base/multi-threading/#Base.Threads.Atomic

@jonathan-laurent
Copy link
Owner

Thanks for the advice @zsunberg! For a start, I decided not to implement parallel MCTS and rather parallelize across game simulations: all MCTS instances are independent and they send requests to a common inference server via a channel. But I'll definitely look into this when I add parallel MCTS into the mix.

@zsunberg
Copy link
Author

Ah, yes, that is probably more immediately useful from a RL perspective.

@zsunberg
Copy link
Author

zsunberg commented Jan 2, 2021

For anyone reading this thread, this is useful: #4 (comment)

@jonathan-laurent
Copy link
Owner

@zsunberg Support for CommonRLInterface.jl is finally merged on master.
I also added your grid world example, which can be solved by running:

julia --project -e 'using AlphaZero.jl; Scripts.train("grid-world")'

@zsunberg
Copy link
Author

Nice, that's great to hear! Looking forward to trying it out, and I will recommend it to the students in my class for their projects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants