Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

self play takes more and more time #41

Closed
magicly opened this issue Apr 21, 2021 · 19 comments
Closed

self play takes more and more time #41

magicly opened this issue Apr 21, 2021 · 19 comments

Comments

@magicly
Copy link

magicly commented Apr 21, 2021

Hi Jonathan!

I'm trying to tune AlphaZero.jl hyperparameters recently, and find some problems. With master(commit 91bb698) and nothing changed, I find that self play takes more and more time.

iter1: 49m gpu 33% cpu 300%
iter2: 2h2m gpu 15% cpu 330%
iter3: 7h30m gpu 4% cpu 230%

memory has 54G free.

this is so strange.

Below is my system info:

cpu: Intel(R) Core(TM) i9-10940X CPU @ 3.30GH 14 physical cores 28 threads
memory: 64G
gpu: NVIDIA-SMI 450.102.04 Driver Version: 450.102.04 CUDA Version: 11.0 , RTX2080ti
OS: ubuntu18.04

julia> versioninfo()
Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i9-10940X CPU @ 3.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, cascadelake)

julia> Threads.nthreads()
28

I think either cpu or gpu fully utilized is ok, but no matter how I change parameters, I just can't make it. And even worse, iter2 use less gpu than iter1, and iter3 even less.

@jonathan-laurent
Copy link
Owner

First of all, thanks for filing this issue. This is reminiscent of an issue I've had from the very start of the project, although the effects here look much more dramatic.

It is also the case on my computer that performances tend to decrease after each iteration, in a way that is fixed by restarting Julia periodically. This decrease in performances is caused by an increasing amount of time spent in GC (if confirmed, this would explain why you see a decrease of CPU utilization over time as GC recollections block all threads as far as I know).

This effect used to be dramatic (see JuliaGPU/CUDA.jl#137) and I remember AlphaZero.jl spending 90% of its time within the GC after one or two iterations. Recently, things have been looking much better on my computer (~20% performance loss over the whole training for connect-four) but the effect is still here.

To confirm what is happening in your case, could you share the performance plots that are automatically generated after each iteration? They should be generated automatically in sessions/.../iter_perfs.png directory and they should enable us to confirm whether or not the ratio of time spent in GC increases at each iteration in your case.

Also, could you run the experiment using different CUDA memory pools by setting JULIA_CUDA_MEMORY_POOL to either "cuda", "split" or "binned" before launching the program?


Admittedly, I am still at loss regarding the source of these performance regressions. A natural explanation would be a memory leak in AlphaZero.jl but I don't see how this can happen as it is sharing very little state across training iterations. Hopefully, your reports should get us closer to the truth as the effects you are observing are so dramatic.

@magicly
Copy link
Author

magicly commented Apr 22, 2021

thanks for you reply. I also suspect there is cuda memory leak, I'll do more experiments and post iter_perfs.png when I'm ready.

ps, CUDA.jl v3 seems to have much improvement, can't wait for it

@magicly
Copy link
Author

magicly commented Apr 22, 2021

GC costs 60%~90% in self play when use default JULIA_CUDA_MEMORY_POOL and costs just about 15% when set

ENV["JULIA_CUDA_MEMORY_POOL"] = "split" 

1
8

maybe we should make split as default.

Besides, there are 2 ways to start to train connect-four in https://jonathan-laurent.github.io/AlphaZero.jl/dev/tutorial/connect_four/ , but there are still bin/main.jl and scripts/alphazero.jl. This make me a little confused, maybe we could make it clear in the docs~

@jonathan-laurent
Copy link
Owner

I updated the Manifest on #master so that it uses CUDA 3.0.
CUDA 3.0 uses the cuda asynchronous memory pool by default, which should be even faster than the splitting pool.
Would you mind running another test with this configuration and show me the perf graphs for each iteration so that we can measure how performance degrades with time?

Besides, there are 2 ways to start to train connect-four in https://jonathan-laurent.github.io/AlphaZero.jl/dev/tutorial/connect_four/ , but there are still bin/main.jl and scripts/alphazero.jl. This make me a little confused, maybe we could make it clear in the docs~

I should just remove the bin directory. This directory was created for the needs of the Juliahub demo as Juliahub needed a bin directory to exist at the time. This also reminds me that I should add a tutorial to run AlphaZero.jl on JuliaHub.

@magicly
Copy link
Author

magicly commented Apr 23, 2021

I get pull the master code, and instantiate the dependencies, but seems nothing changed. When I don't change alphazero.jl, there is still 1h30m ETA after 50% progress, and the gpu is about 8%, so I interrupted it. When I comment out line 6 in alphazero.jl (means setting ENV["JULIA_CUDA_MEMORY_POOL"] = "split"), there is still 42m ETA after 30% progress and the gpu is about 30%, which I think is a little slower than previous split mode result.

Do I need to set ENV["JULIA_CUDA_MEMORY_POOL"] to some value?

Besides, I notice that you just add NNlibCUDA in deps, and Flux keeps the same, is that ok?

@magicly
Copy link
Author

magicly commented Apr 23, 2021

this is iter1 and iter2 perfs when settingENV["JULIA_CUDA_MEMORY_POOL"] = "split" .

1
2

Error happens when playing in iter 3:

Starting iteration 3

======self play starting
  Starting self-play

CUDNNError: CUDNN_STATUS_EXECUTION_FAILED (code 8)
Stacktrace:
  [1] throw_api_error(res::CUDA.CUDNN.cudnnStatus_t)
    @ CUDA.CUDNN ~/.julia/packages/CUDA/Px7QU/lib/cudnn/error.jl:22
  [2] macro expansion
    @ ~/.julia/packages/CUDA/Px7QU/lib/cudnn/error.jl:39 [inlined]
  [3] cudnnActivationForward(handle::Ptr{Nothing}, activationDesc::CUDA.CUDNN.cudnnActivationDescriptor, alpha::Base.RefValue{Float32}, xDesc::CUDA.CUDNN.cudnnTensorDescriptor, x::CUDA.CuArray{Float32, 4}, beta::Base.RefValue{Float32}, yDesc::CUDA.CUDNN.cudnnTensorDescriptor, y::CUDA.CuArray{Float32, 4})
    @ CUDA.CUDNN ~/.julia/packages/CUDA/Px7QU/lib/utils/call.jl:26
  [4] #cudnnActivationForwardAD#645
    @ ~/.julia/packages/CUDA/Px7QU/lib/cudnn/activation.jl:48 [inlined]
  [5] #cudnnActivationForwardWithDefaults#644
    @ ~/.julia/packages/CUDA/Px7QU/lib/cudnn/activation.jl:42 [inlined]
  [6] #cudnnActivationForward!#641
    @ ~/.julia/packages/CUDA/Px7QU/lib/cudnn/activation.jl:22 [inlined]
  [7] #46
    @ ~/.julia/packages/NNlibCUDA/ESR3l/src/cudnn/activations.jl:13 [inlined]
  [8] materialize(bc::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{4}, Nothing, typeof(NNlib.relu), Tuple{CUDA.CuArray{Float32, 4}}})
    @ NNlibCUDA ~/.julia/packages/NNlibCUDA/ESR3l/src/cudnn/activations.jl:30
  [9] (::Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1}, Float32, CUDA.CuArray{Float32, 1}})(x::CUDA.CuArray{Float32, 4}, cache::Nothing)
    @ Flux.CUDAint ~/.julia/packages/Flux/Lffio/src/cuda/cudnn.jl:9
 [10] BatchNorm
    @ ~/.julia/packages/Flux/Lffio/src/cuda/cudnn.jl:6 [inlined]
 [11] applychain(fs::Tuple{Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1}, Float32, CUDA.CuArray{Float32, 1}}, Flux.Chain{Tuple{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 1}}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1}, Float32, CUDA.CuArray{Float32, 1}}, Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 1}}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1}, Float32, CUDA.CuArray{Float32, 1}}}}, typeof(+)}, AlphaZero.FluxLib.var"#15#16"}}, Flux.Chain{Tuple{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 1}}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1}, Float32, CUDA.CuArray{Float32, 1}}, Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 1}}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1}, Float32, CUDA.CuArray{Float32, 1}}}}, typeof(+)}, AlphaZero.FluxLib.var"#15#16"}}, Flux.Chain{Tuple{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 1}}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1}, Float32, CUDA.CuArray{Float32, 1}}, Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 1}}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1}, Float32, CUDA.CuArray{Float32, 1}}}}, typeof(+)}, AlphaZero.FluxLib.var"#15#16"}}, Flux.Chain{Tuple{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 1}}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1}, Float32, CUDA.CuArray{Float32, 1}}, Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 1}}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1}, Float32, CUDA.CuArray{Float32, 1}}}}, typeof(+)}, AlphaZero.FluxLib.var"#15#16"}}, Flux.Chain{Tuple{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 1}}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1}, Float32, CUDA.CuArray{Float32, 1}}, Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 1}}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1}, Float32, CUDA.CuArray{Float32, 1}}}}, typeof(+)}, AlphaZero.FluxLib.var"#15#16"}}}, x::CUDA.CuArray{Float32, 4}) (repeats 2 times)
    @ Flux ~/.julia/packages/Flux/Lffio/src/layers/basic.jl:36
 [12] (::Flux.Chain{Tuple{Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 1}}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1}, Float32, CUDA.CuArray{Float32, 1}}, Flux.Chain{Tuple{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 1}}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1}, Float32, CUDA.CuArray{Float32, 1}}, Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 1}}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1}, Float32, CUDA.CuArray{Float32, 1}}}}, typeof(+)}, AlphaZero.FluxLib.var"#15#16"}}, Flux.Chain{Tuple{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 1}}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1}, Float32, CUDA.CuArray{Float32, 1}}, Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 1}}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1}, Float32, CUDA.CuArray{Float32, 1}}}}, typeof(+)}, AlphaZero.FluxLib.var"#15#16"}}, Flux.Chain{Tuple{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 1}}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1}, Float32, CUDA.CuArray{Float32, 1}}, Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 1}}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1}, Float32, CUDA.CuArray{Float32, 1}}}}, typeof(+)}, AlphaZero.FluxLib.var"#15#16"}}, Flux.Chain{Tuple{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 1}}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1}, Float32, CUDA.CuArray{Float32, 1}}, Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 1}}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1}, Float32, CUDA.CuArray{Float32, 1}}}}, typeof(+)}, AlphaZero.FluxLib.var"#15#16"}}, Flux.Chain{Tuple{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 1}}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1}, Float32, CUDA.CuArray{Float32, 1}}, Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 1}}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1}, Float32, CUDA.CuArray{Float32, 1}}}}, typeof(+)}, AlphaZero.FluxLib.var"#15#16"}}}})(x::CUDA.CuArray{Float32, 4})
    @ Flux ~/.julia/packages/Flux/Lffio/src/layers/basic.jl:38
 [13] forward(nn::ResNet, state::CUDA.CuArray{Float32, 4})
    @ AlphaZero.FluxLib ~/code/jls/AlphaZero.jl/src/networks/flux.jl:142
 [14] forward_normalized(nn::ResNet, state::CUDA.CuArray{Float32, 4}, actions_mask::CUDA.CuArray{Float32, 2})
    @ AlphaZero.Network ~/code/jls/AlphaZero.jl/src/networks/network.jl:260
 [15] evaluate_batch(nn::ResNet, batch::Vector{NamedTuple{(:board, :curplayer), Tuple{StaticArrays.SMatrix{7, 6, UInt8, 42}, UInt8}}})
    @ AlphaZero.Network ~/code/jls/AlphaZero.jl/src/networks/network.jl:308
 [16] fill_and_evaluate(net::ResNet, batch::Vector{NamedTuple{(:board, :curplayer), Tuple{StaticArrays.SMatrix{7, 6, UInt8, 42}, UInt8}}}; batch_size::Int64, fill::Bool)
    @ AlphaZero ~/code/jls/AlphaZero.jl/src/simulations.jl:32
 [17] (::AlphaZero.var"#34#35"{Bool, ResNet, Int64})(batch::Vector{NamedTuple{(:board, :curplayer), Tuple{StaticArrays.SMatrix{7, 6, UInt8, 42}, UInt8}}})
    @ AlphaZero ~/code/jls/AlphaZero.jl/src/simulations.jl:49
 [18] macro expansion
    @ ~/code/jls/AlphaZero.jl/src/batchifier.jl:62 [inlined]
 [19] macro expansion
    @ ~/code/jls/AlphaZero.jl/src/util.jl:57 [inlined]
 [20] (::AlphaZero.Batchifier.var"#1#3"{AlphaZero.var"#34#35"{Bool, ResNet, Int64}, Int64, Channel{Any}})()
    @ AlphaZero.Batchifier ./threadingconstructs.jl:169

@jonathan-laurent
Copy link
Owner

Besides, I notice that you just add NNlibCUDA in deps, and Flux keeps the same, is that ok?

If you look at the Manifest, I am now using the dg/cuda16 development branch of Flux, which should be merged pretty soon in a new patch release (see FluxML/Flux.jl#1571).

Do I need to set ENV["JULIA_CUDA_MEMORY_POOL"] to some value?

With CUDA 3.0, you should set it to "cuda" or not set it at all ("cuda" is the default memory pool provided that your CUDA toolkit has version >11.2). I agree that the comment in scripts/alphazero.jl is outdated here and I should update it.

Error happens when playing in iter 3:

These kinds of errors often mean an OOM in disguise... Given your config, this probably indicates a memory leak somewhere... It would be interesting to know if you can get the same error using the "cuda" memory pool.

Thanks for your help in figuring this out. This is very helpful!

@magicly
Copy link
Author

magicly commented Apr 23, 2021

thanks. I'll test 'cuda' memory pool~

@magicly
Copy link
Author

magicly commented Apr 24, 2021

LoadError: AssertionError: The CUDA memory pool is only supported on CUDA 11.2+

seems that we find the reason.

my computer was broken yesterday, it takes me some time to recovery it. I'll update CUDA version and test it again.

@jonathan-laurent
Copy link
Owner

Please note that CUDA.jl usually installs its own version of the CUDA toolkit. Therefore, if your CUDA driver is compatible with 11.2, you might not even need to update your system's CUDA installation.

You can use CUDA.versioninfo() to check what version of the CUDA toolkit CUDA.jl is currently using.

@magicly
Copy link
Author

magicly commented Apr 25, 2021

Yeah, I remember when I instantiate CUDA3, pkg downloads cuda tookit artifact. seems sth wrong.

julia> CUDA.versioninfo()
CUDA toolkit 11.0.3, artifact installation
CUDA driver 11.0.0
NVIDIA driver 450.102.4

Libraries:
- CUBLAS: 11.2.0
- CURAND: 10.2.1
- CUFFT: 10.2.1
- CUSOLVER: 10.6.0
- CUSPARSE: 11.1.1
- CUPTI: 13.0.0
- NVML: 11.0.0+450.102.4
- CUDNN: 8.10.0 (for CUDA 11.2.0)
- CUTENSOR: 1.2.2 (for CUDA 11.1.0)

Toolchain:
- Julia: 1.6.0
- LLVM: 11.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

1 device:
  0: GeForce RTX 2080 Ti (sm_75, 10.367 GiB / 10.759 GiB available)

I'll do it manually

@magicly
Copy link
Author

magicly commented Apr 25, 2021

I think it's because my Nvidia driver is too low. When I update Nvidia driver to 460.73.01, everything is ok.

Interestingly, after upgrade driver, I install cuda tookit from Nvidia website, which is 11.3. When using alphazero, cuda still downloads artifact cuda toolkit 11.2.

No matter whether I set ENV["JULIA_CUDA_MEMORY_POOL"] = "cuda" or not, it takes about 40m per iter self play, almost same as previously split mode.

@magicly
Copy link
Author

magicly commented Apr 25, 2021

I find another problem: Running on my i7-8700 CPU @ 3.20GHz, GPU 0: GeForce RTX 2080 SUPER, 32G memory is 20% faster than i9-10940X CPU @ 3.30GHz, GPU 0: GeForce RTX 2080 Ti, 64G memory in self play, about 60 vs 50 samples per second, 50% vs 35% gpu utilization, 500% vs 200% cpu utilization.

Julia version, Nvidia Driver / CUDA Toolkit version, AlphaZero.jl code, I try to keep everything same. It's so strange. Yeah, this maybe another issue, I'll do more digging.

@jonathan-laurent
Copy link
Owner

No matter whether I set ENV["JULIA_CUDA_MEMORY_POOL"] = "cuda" or not, it takes about 40m per iter self play, almost same as previously split mode.

On CUDA 3.0, "cuda" is the default memory pool and it will therefore be used whether you set ENV["JULIA_CUDA_MEMORY_POOL"] = "cuda" or not.

Interestingly, after upgrade driver, I install cuda tookit from Nvidia website, which is 11.3. When using alphazero, cuda still downloads artifact cuda toolkit 11.2.

This is because 11.3 is recent and the CUDA.jl maintainers haven't tested it properly yet. I imagine this will be fixed in the next release though.

Running on my i7-8700 CPU @ 3.20GHz, GPU 0: GeForce RTX 2080 SUPER, 32G memory is 20% faster than i9-10940X CPU @ 3.30GHz, GPU 0: GeForce RTX 2080 Ti, 64G memory in self play, about 60 vs 50 samples per second, 50% vs 35% gpu utilization, 500% vs 200% cpu utilization.
Julia version, Nvidia Driver / CUDA Toolkit version, AlphaZero.jl code, I try to keep everything same. It's so strange.

This is surprising indeed and something I have also observed in the past (running AlphaZero.jl on supposedly more performant machines results in worse performances). One possible reason (which I am currently investigating) is that Julia currently does not allow tasks to migrate between threads and so random circumstances that influence what task gets assigned to what thread by the scheduler may result in unbalanced CPU loads.

@magicly
Copy link
Author

magicly commented Apr 26, 2021

I suspect maybe there is too much threads(each thread for a worker, 128 threads vs my 14 core cpu), leading to much context switch. I think the best is to make either cpu or gpu nearly 100% utillized. I'm trying to use a threadpool to see if we can get there.

Besides, maybe there is still cuda memory leak. It just crash after iter3, throwing cuda error. I post a issuse here JuliaGPU/CUDA.jl#866

@jonathan-laurent
Copy link
Owner

Last time I checked, 128 workers (not 128 threads: Julia will spawn 128 tasks and spread them on as many threads as you have CPU cores available) were faster on my computer than 64. The goal is not so much to parallelize simulation but to send big batches to the neural network. One reason the GPU utilization is not currently higher (beyond time spent in GC) is that the inference server currently stops the world as it only runs when all of the 128 workers as stuck on an inference request. Therefore, every time it is done running, it must stay idle as it waits for all the workers to send data again. I am going to try and see if I can optimize this.

Ultimately, I agree with you that we should shoot for either ~100% GPU utilization or CPU utilization and your criterion is a good one.

Also, it is very good that you are investigating possible memory leaks in CUDA and I will be investigating this.

@jonathan-laurent
Copy link
Owner

jonathan-laurent commented Apr 26, 2021

I just pushed a change that enables a major speedup on my computer (I went from ~40% GPU utilization during self-play to ~70%). Indeed, I am allowing the number of simulation agents to be larger than the batch size of the inference server so that the CPUs can keep simulating games while the GPU is running. You may want to try it out. Also, if your computer has a lot of RAM (>=32G) and a powerful GPU, you may want to increase the number of simulation workers and the batch size:

params.self_play.sim.batch_size=128
params.self_play.sim.num_workers=256

@magicly
Copy link
Author

magicly commented Apr 27, 2021

Wow, you make a lot of commit. thanks very much. I'll test it and post the result~

@magicly
Copy link
Author

magicly commented Apr 29, 2021

I test the default param:

params.self_play.sim.batch_size=64
params.self_play.sim.num_workers=128

it is about 40% speed up, 40minute vs 30minute, 50 samples per second 72 samples per second, and gpu 40% vs 55%(So I think there is still another ~2x speed up chance ). But double param make it slower, so strange.

I'll do more experiments and try to figure it out. When I'm ready, I'll post another issue or make a pr.

@magicly magicly closed this as completed Apr 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants