Training on the cloud / multiple instances / clusters #50

SheldonCurtiss · 2021-07-23T13:02:01Z

Any tips for running this on Azure without paying Julia hubs insane premium?
I'm trying to leverage spot pricing which is about 1/10th-1/20th the cost of juliahubs pricing.

I found this:
https://github.com/microsoft/AzureClusterlessHPC.jl

I'm not entirely sure how exactly Juliahub handles running this code on multiple machines together... Is there a command or something to connect multiple instances together or something built in similar to Ray? Or will this be an incredibly painful process of setting up the code for use with that previous github I linked?

jonathan-laurent · 2021-07-23T16:03:03Z

Thanks for your interest in AlphaZero.jl!
I have never used AlphaZero.jl on Azure and I'm not especially familiar with Azure either.

AlphaZero.jl itself does not deal with any kind of cluster setup. It just gets a list of available workers using the Distributed module and splits the work equally between them. What's nice with JuliaHub is that it takes care of the details of configuring a cluster and spawning remote processes, but I guess it should not be hard to configure the system to work on your own cluster: see the documentation.

I am not familiar with the package you linked to but it looks like a replacement for Distributed so it is may not be what you want here. If you want more general advice on running Julia code that relies on Distributed.jl on Azure (as is the case of AlphaZero), I would advise you to ask on Discourse or on the Julia Slack. :-)

SheldonCurtiss · 2021-07-23T17:34:37Z

Sweet! This looks great - Also while I have you can I ask two super quick questions -
I'm using AlphaZero.GameInterface.init to initialize my game in a random way, will that pose any issues for replays?
I'm also doing a two player game in which each player can make the same moves and can have inventories but I'm not having them do anything on the board which also concerns me if that will somehow break replays?

Sorry - I'm really new to reinforcement learning and julia, kinda working this out as I go.

findmyway · 2021-07-23T17:46:19Z

I found this:
https://github.com/microsoft/AzureClusterlessHPC.jl

I'm not entirely sure how exactly Juliahub handles running this code on multiple machines together... Is there a command or something to connect multiple instances together or something built in similar to Ray? Or will this be an incredibly painful process of setting up the code for use with that previous github I linked?

Programming in AzureClusterlessHPC.jl is quite different from Distributed.jl. You need to write some extra code to make AlphaZero.jl work in it. I'd suggest you use AKS instead. With K8sClusterManagers.jl, AlphaZero.jl should work out of the box.

jonathan-laurent · 2021-07-23T17:54:19Z

I'm using AlphaZero.GameInterface.init to initialize my game in a random way, will that pose any issues for replays?

Having init initialize the game randomly should work and in fact the grid-world example does this.

I'm also doing a two player game in which each player can make the same moves and can have inventories but I'm not having them do anything on the board which also concerns me if that will somehow break replays?

I am not sure I understand the question here. What are you calling "board"? In your case, if both players have inventories, these inventories should be part of the state.

SheldonCurtiss · 2021-07-23T18:27:08Z

I am not sure I understand the question here. What are you calling "board"? In your case, if both players have inventories, these inventories should be part of the state.

Sorry - Going off the examples state is board and player.
That answers my question though I'll do it that way.

SheldonCurtiss · 2021-07-23T18:31:15Z

I found this:
https://github.com/microsoft/AzureClusterlessHPC.jl
I'm not entirely sure how exactly Juliahub handles running this code on multiple machines together... Is there a command or something to connect multiple instances together or something built in similar to Ray? Or will this be an incredibly painful process of setting up the code for use with that previous github I linked?

Programming in AzureClusterlessHPC.jl is quite different from Distributed.jl. You need to write some extra code to make AlphaZero.jl work in it. I'd suggest you use AKS instead. With K8sClusterManagers.jl, AlphaZero.jl should work out of the box.

Awesome awesome thank you so much!

SheldonCurtiss · 2021-07-23T18:45:51Z

Having init initialize the game randomly should work and in fact the grid-world example does this.

Speaking of the grid-world example, I steered away from it since it used CommonRLInterface as opposed to AlphaZero.GI so I wasn't entirely sure and it functioned incredibly different than the other examples.

jonathan-laurent · 2021-07-23T19:01:33Z

Speaking of the grid-world example, I steered away from it since it used CommonRLInterface as opposed to AlphaZero.GI so I wasn't entirely sure and it functioned incredibly different than the other examples.

I agree that this example looks pretty different on the surface but remember that AlphaZero.jl only provides a thin wrapper over CommonRLInterface.jl. Therefore, it should not be too hard to translate the example so that it uses AlphaZero.GameInterface.

Good luck using AlphaZero on your game and please don't hesitate to report back about your results or experience!

SheldonCurtiss changed the title ~~Running this on Azure~~ Training on the cloud / multiple instances / clusters Jul 23, 2021

jonathan-laurent closed this as completed Jul 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training on the cloud / multiple instances / clusters #50

Training on the cloud / multiple instances / clusters #50

SheldonCurtiss commented Jul 23, 2021

jonathan-laurent commented Jul 23, 2021

SheldonCurtiss commented Jul 23, 2021

findmyway commented Jul 23, 2021

jonathan-laurent commented Jul 23, 2021

SheldonCurtiss commented Jul 23, 2021

SheldonCurtiss commented Jul 23, 2021

SheldonCurtiss commented Jul 23, 2021

jonathan-laurent commented Jul 23, 2021

Training on the cloud / multiple instances / clusters #50

Training on the cloud / multiple instances / clusters #50

Comments

SheldonCurtiss commented Jul 23, 2021

jonathan-laurent commented Jul 23, 2021

SheldonCurtiss commented Jul 23, 2021

findmyway commented Jul 23, 2021

jonathan-laurent commented Jul 23, 2021

SheldonCurtiss commented Jul 23, 2021

SheldonCurtiss commented Jul 23, 2021

SheldonCurtiss commented Jul 23, 2021

jonathan-laurent commented Jul 23, 2021