Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sharedRmsProp and async nature params #15

Merged
merged 1 commit into from
May 7, 2016
Merged

Conversation

lake4790k
Copy link
Collaborator

@lake4790k lake4790k commented May 7, 2016

shared rmsprop

The original dqn code has rmsprop based on g and gSquared (squared momentum). The paper uses shared RmsProp optimization, so I wanted to make the existing code shared, but noticed calculating with both g and gSq cannot be made "thread safe" within the async hogwild setting as gradient updates become NaN because of concurrent updates to g and gSq (can't be fixed without locking). Then I realized the paper just uses g and no squared momentum and doing like that works indeed with async hogwild. Shared rmsprop now looks similar to the torch rmsprop which btw seems buggy as it adds epsilon outside the sqrt.

speed

This seems a huge issue. The figures in the paper suggest their implementation is running at ~700k frames per hour per thread. Mine runs at 125k per hour per thread. Also I was running the original per thread rmsprop and saw it was converging, but much slower than the paper, which would be explained by the processing speed difference.

This 5.5x speed difference is huge, even if their implementation is in c++, lua isn't supposed to be that slower. This must be improved as this way this async implementation is slower than running ER on GPU. I'll try speeding up the code.

ALE

Another issue is ALE thread safety. ALE author says:

ALE is close to, but possibly not quite thread-safe.

so I'm using a patched version of xitari where I modified the static fields of one class to be instance fields, this is was the most obvious issue. I'll look if there's some other global state. Paper doesn't mention this, but they could have made more fixes. Its good at least Bellemare says "close to"...

@Kaixhin Kaixhin merged commit 4d2559c into Kaixhin:async May 7, 2016
@Kaixhin
Copy link
Owner

Kaixhin commented May 7, 2016

shared rmsprop

Yep makes sense. Interesting spot with the optim implementation - I made a note in the currently open issue torch/optim#99.

speed

As this code is focused on integrating these different methods, a bare-bones implementation of the model and training will be faster. With if statements, I'm hoping that the difference won't be too bad. Frequent logging reduces speed. I'm not sure if that's too much of an issue, but it might be amplified as a result of having several threads doing logging (to stdout and a file).

One thing we cannot solve is hardware - they use 16 core CPUs - that's one physical (not logical) core per thread. The fact that they don't show 32 threads makes me suspect that the number of physical cores should be equal to the number of threads.

ALE

I think that xitari is using 0.4.4. 0.5.0 is a major revision and the latest version is 0.5.1, which are presumably a bit faster. It now also has a direct greyscale output, which if we had access to could probably speed up all code significantly.

@lake4790k
Copy link
Collaborator Author

lake4790k commented May 7, 2016

speed

I just arrived at the same conclusion, that they must be using 16 physical cores... This is so computationally intensive that running 2x of physical cores (on all hyperthreaded cores) almost halves the speed in our Atari, so my 11 cores are really just 5.5...

I computed 200 frames/secs per thread based on the figures and comments in their paper, but that may be an overestimation (not sure what they mean by "frame per secs"): according to this DeepMind said they achieved 70 / secs. On my i7 I measure now 80 steps per sec (could make sense as many core xeons are much slower than i7s), but I have then only 5.5 / 16 ~ 1/3 number of their cores. This would then explain why they are that much faster.

I also measured how much time different parts in the training loop take. One suspicion I had that the CircularBuffer with the table of tensors > tensors cat, could be slow, but actually takes not much.

One optimisation that I can make that if the eGreedy decision is greedy the loop will do forward 2x on the same state, rather Qs can be reused when computing the next Y, so one step is instead of 2 forwards (in case of greedy action) + 1 backward will be only 1 forward + 1 backward. This adds some speed in the later stages when epsilon becomes lower, but not too much (1 fwd is smaller part of the whole time)

I measure the pure speed of barebone forward + backward in script and then commented out all the other stuff in the AsyncAgent except for the fwd/backward, then it ran at the barebone speed as expected. But then everything I commented out is absolutely needed, so I don't see some additional big gains to be made.

But now with the # of cores in perspective, it actually looks quite promising. To be practical one may need a huge xeon instead of an i7, so in practice may not be faster than ER/GPU, but thats not the codes fault.

And also nice that the multithreaded works efficiently with torch, other python projects seem to complain about speed because of the GIL as I predicted...

Yes you are right about supporting all the different methods can have an overhead (eg. having the headConcat even when not using bootstrapped adds some overhead I measured), but this is also what makes your project uniquely valuable allowing the combination of all.

ALE

yes, grayscale to begin with would help for sure!

@Kaixhin
Copy link
Owner

Kaixhin commented May 7, 2016

Thanks for having a look at where the code can be optimised - good to know (at least for me) that it's pretty tight. One thing that might be worth checking with your benchmarking code is whether cuDNN R5 in deterministic mode (see this old code for details) is now faster than Torch's standard CUDA code. Not relevant for async methods, but could be a nice speedup for everything else. Eventually Torch will get half-precision support, which could also help if these models don't require too much precision.

I barely know C/C++, so unfortunately making a wrapper for the new ALE is beyond me.

@lake4790k
Copy link
Collaborator Author

lake4790k commented May 8, 2016

re: benchmarking ER/gpu I'm planning to improve my Timer class, so one doesn't need to add explicit calls to it, just regitser classes/functions to measure, I'll add it to Atari soon and with that I'll have a look at the ER/gpu case!

hypterthreaded async I think it will still make sense to run async on all logical cores (esp. on lower core count i7s). The overall frames / sec speed will not increase much (eg. 40 steps/sec on hypterthreaded cores vs 70 steps/sec on non hypterthreaded), but the diversity of experiences learnt will increase similar to random sampling from ER memory as the paper points out also.

multi CPU async I think would be really interesting to see if the hogwild async would scale across multiple CPUs. It's a question of how the CPU caches get synchronized between the CPUs. Eg. would be awesome to get an almost 2x speed gain on a dual xeon machine compared to a single xeon.

Although DeepMind didn't mention this, I'm somewhat optimistic: this paper from Yahoo discusses hogwild SGD and says:

For single CPU socket machines, no loss was measured. For two-sockets machines, where the latency between cores can be higher, the loss stayed below 0.5% for most network sizes.

and also the DM Gorilla is even distributed between machines, so I would expect it to work resulting in really state of the art performance. Maybe scaling beyond 16 threads doesn't increase the diversity effect, but still scales the raw speed well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants