MDP Version #9

zsunberg · 2020-06-11T05:40:38Z

First of all, it's very cool to see AlphaZero implemented in Julia! I have always thought that Julia is a really good tool for this type of thing.

This comment is related to #4, but maybe is a slightly different perspective. It would be very cool to also have a version of this that works with MDPs, either using RLBase as @findmyway suggested in #4, or POMDPs.jl (which I and some colleagues work on), or RLInterface.jl. This would need to be a considerably different implementation because the MCTS would be approximating an expectimax tree instead of a minimax tree.

Just thought I should start this issue as a stub in case anyone wants to pick it up, and to point to some MDP definition interfaces in Julia, and to clarify that the game and MDP versions would need to be different. I and my students would definitely use the package if it had support for MDPs.

jonathan-laurent · 2020-06-12T00:50:01Z

Thanks for your interest in AlphaZero.jl.

It is my plan indeed to add support for MDPs. This requires adding three features:

Support for single player games
Support for intermediate (non-terminal) rewards
Support for stochastic transition functions

Regarding the last point (stochasticity), there are two possible approaches and I would appreciate some opinion / feedback.

Approach 1: with explicit chance nodes

In this approach, we add explicit chance nodes in MCTS and have the play function return a probability distribution over transitions. During UCT evaluation, the value of a chance node is defined by the weighted average of the value of its children.

The advantage of this approach is that it leverages full-knowledge of the transition probabilities. On the other hand, it is restricted to finite distributions with relatively small supports (although optimizations such as node grouping can be used to handle larger chance branching factors).

Approach 2: without explicit chance nodes

The alternative is to have play sample and return a transition. In this case, we can use the current MCTS implementation without modification. Indeed, the current implementation does not make any assumption that the game is deterministic. Nodes are not identified by the sequence of actions that led to it. Rather, a Q function estimate is maintained for every visited state (therefore implementing full node sharing, see discussion in #4). When an action is played during tree search, one of the possible child nodes is sampled randomly through play.

This approach can accommodate arbitrary transition distributions. It also has the advantage of treating deterministic games and stochastic games identically, reducing complexity.

Asymptotically, I am pretty sure both approaches are equivalent. However, I have been looking for an empirical comparison but haven't found any. Does anyone have an opinion on this?

zsunberg · 2020-06-12T04:19:01Z

The most common way is more similar to approach 2 - you simply sample a single new state and reward from the transition probability distribution.

If you have access to IEEE publications, there is a very clear pseudocode description in algorithm 4.9 of this book chapter: https://ieeexplore-ieee-org.colorado.idm.oclc.org/document/7288642

You can also see one of our implementations here, https://github.com/JuliaPOMDP/MCTS.jl, though this code was written long ago, so I would say it is not the best to get insight from and there are some mistakes that I would do differently now (and plan to in the future).

Happy to review code or help in another way!

zsunberg · 2020-06-12T04:58:32Z

It will also be important to be able to handle non-terminal rewards for MDPs (I noticed this was mentioned in #4)

jonathan-laurent · 2020-06-12T10:48:26Z

Thanks!

If you have access to IEEE publications, there is a very clear pseudocode description in algorithm 4.9 of this book chapter: https://ieeexplore-ieee-org.colorado.idm.oclc.org/document/7288642

The link you gave me points to a "University of Colorado" portal. Can you give me the book title?

zsunberg · 2020-06-12T15:30:00Z

Oops, haha. It's called "Decision Making under Uncertainty: Theory and Application". This link should work: https://ieeexplore.ieee.org/document/7288642

jonathan-laurent · 2020-06-12T17:36:19Z

Thanks for the reference.
The proposed algorithm pretty much corresponds to what is already implemented, except that it supports intermediate rewards (also, AlphaZero uses a slightly different UCT formula).

JacksonKearl · 2020-06-22T15:47:37Z

I’ve been looking for something that can model games with a dice roll to start each turn. In my impression this would require Approach 1 above, as every possible roll should be considered.

How could this be accompanied using approach 2? It seems to me like it would only be able to sample a single possible roll.

jonathan-laurent · 2020-06-22T18:03:24Z

@JacksonKearl Approach 2 would be able to accommodate games with a dice roll.

Yes, during each move of each MCTS simulation, only the result of a single dice roll is explored. But this is ok, as each simulation is going to explore different stochastic choices and the frequency of exploring each choice will be proportional to its probability. Even in a deterministic game, MCTS is never able to explore every possible branch (it differs from minmax in this way).

Even when using Approach 1, a single dice roll would be sampled at each chance node. The only difference is that some info from the exact probability distribution could potentially be used when backpropagating rewards.

jonathan-laurent · 2020-06-30T14:22:50Z

An update: the last release (v0.3.0) generalized the game interface. In particular, it adds support for stochastic games and intermediate rewards. There shouldn't be much work to add support for CommonRLInterface.jl now.

jonathan-laurent · 2020-06-30T14:31:17Z

@zsunberg Do you have a nice MDP environment to suggest that I could use as a test example for MDP support in AlphaZero.jl?

JacksonKearl · 2020-06-30T16:26:49Z

@jonathan-laurent are these updates in the docs? I don't see anything here: https://jonathan-laurent.github.io/AlphaZero.jl/stable/reference/game_interface/

jonathan-laurent · 2020-06-30T16:37:40Z

@JacksonKearl The doc seems up-to-date to me (you can compare it with the doc for v0.2.0). In particular, at the top of the page you link too, one can read that "stochastic games and intermediate rewards are supported". Note however that:

MDP support is not there yet (it would require implementing support for single-player games)
I have decided to wait a bit before advertising these features more heavily as they haven't been thoroughly tested yet.

zsunberg · 2020-06-30T18:27:18Z

@jonathan-laurent Awesome. I would say start simple with a Grid World problem (you can google it to see what it is). That way you can compare to the optimal solution from value iteration easily. We have one implemented in the POMDPs.jl interface here: https://github.com/JuliaPOMDP/POMDPModels.jl/blob/master/src/gridworld.jl I have been a bit busy the last few days, but I hope to do some more work on CommonRLInterface this evening and tomorrow, so I will try to implement a grid world problem in that interface (or just provide POMDPs <-> CommonRLInterface compatibility) soon.

jonathan-laurent · 2020-06-30T21:23:30Z

@zsunberg I think grid world is a good suggestion. However, there should not be a unique, fixed world, in which case there is no opportunity for generalization and AlphaZero is useless. I think a nice setting would be to generate a random world at the beginning of each episode and to allow the agent to observe the whole grid.

Thanks for your work on CommonRLInterface.jl. I think it is a really valuable project.

zsunberg · 2020-07-01T16:10:50Z

However, there should not be a unique, fixed world,...

Yes, I agree that a fixed grid world is of course not a good showcase for AlphaZero :) but I think it is an important first step so that we can verify all of our semantics are the same and it will solve the simplest problems correctly.

zsunberg · 2020-07-15T17:04:12Z

@jonathan-laurent I added a grid world example in the common rl interface package. https://github.com/JuliaReinforcementLearning/CommonRLInterface.jl/tree/master/examples Do you think that will work?

jonathan-laurent · 2020-07-15T22:37:16Z

Thanks @zsunberg. I won't try it straight away as my priority right now is to add support for distributed computing but this is in second position on my TODO list.

zsunberg · 2020-07-17T04:16:11Z

@jonathan-laurent , cool. By the way, one thing I have heard a bunch about implementing parallel MCTS is that using locks usually doesn't work that well. Instead, it is better to use atomic operations that can just fail if there is a conflict between two threads. Luckily Julia has these: https://docs.julialang.org/en/v1/base/multi-threading/#Base.Threads.Atomic

jonathan-laurent · 2020-07-17T14:12:48Z

Thanks for the advice @zsunberg! For a start, I decided not to implement parallel MCTS and rather parallelize across game simulations: all MCTS instances are independent and they send requests to a common inference server via a channel. But I'll definitely look into this when I add parallel MCTS into the mix.

zsunberg · 2020-07-18T15:03:30Z

Ah, yes, that is probably more immediately useful from a RL perspective.

zsunberg · 2021-01-02T17:50:29Z

For anyone reading this thread, this is useful: #4 (comment)

jonathan-laurent · 2021-02-11T21:49:21Z

@zsunberg Support for CommonRLInterface.jl is finally merged on master.
I also added your grid world example, which can be solved by running:

julia --project -e 'using AlphaZero.jl; Scripts.train("grid-world")'

zsunberg · 2021-02-11T23:54:10Z

Nice, that's great to hear! Looking forward to trying it out, and I will recommend it to the students in my class for their projects.

jonathan-laurent closed this as completed Apr 17, 2021

StepHaze mentioned this issue May 10, 2022

StackOverflowError (during training) #116

Closed

smart-fr mentioned this issue Jan 19, 2023

Stack overflow with lots of RAM still available #164

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MDP Version #9

MDP Version #9

zsunberg commented Jun 11, 2020

jonathan-laurent commented Jun 12, 2020

zsunberg commented Jun 12, 2020 •

edited

Loading

zsunberg commented Jun 12, 2020

jonathan-laurent commented Jun 12, 2020

zsunberg commented Jun 12, 2020

jonathan-laurent commented Jun 12, 2020

JacksonKearl commented Jun 22, 2020

jonathan-laurent commented Jun 22, 2020

jonathan-laurent commented Jun 30, 2020

jonathan-laurent commented Jun 30, 2020

JacksonKearl commented Jun 30, 2020

jonathan-laurent commented Jun 30, 2020

zsunberg commented Jun 30, 2020

jonathan-laurent commented Jun 30, 2020

zsunberg commented Jul 1, 2020

zsunberg commented Jul 15, 2020

jonathan-laurent commented Jul 15, 2020

zsunberg commented Jul 17, 2020

jonathan-laurent commented Jul 17, 2020

zsunberg commented Jul 18, 2020

zsunberg commented Jan 2, 2021

jonathan-laurent commented Feb 11, 2021

zsunberg commented Feb 11, 2021

MDP Version #9

MDP Version #9

Comments

zsunberg commented Jun 11, 2020

jonathan-laurent commented Jun 12, 2020

zsunberg commented Jun 12, 2020 • edited Loading

zsunberg commented Jun 12, 2020

jonathan-laurent commented Jun 12, 2020

zsunberg commented Jun 12, 2020

jonathan-laurent commented Jun 12, 2020

JacksonKearl commented Jun 22, 2020

jonathan-laurent commented Jun 22, 2020

jonathan-laurent commented Jun 30, 2020

jonathan-laurent commented Jun 30, 2020

JacksonKearl commented Jun 30, 2020

jonathan-laurent commented Jun 30, 2020

zsunberg commented Jun 30, 2020

jonathan-laurent commented Jun 30, 2020

zsunberg commented Jul 1, 2020

zsunberg commented Jul 15, 2020

jonathan-laurent commented Jul 15, 2020

zsunberg commented Jul 17, 2020

jonathan-laurent commented Jul 17, 2020

zsunberg commented Jul 18, 2020

zsunberg commented Jan 2, 2021

jonathan-laurent commented Feb 11, 2021

zsunberg commented Feb 11, 2021

zsunberg commented Jun 12, 2020 •

edited

Loading