Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-evaluating device info when get closer to zero mem #153

Merged
merged 4 commits into from
Jun 1, 2016

Conversation

drnikolaev
Copy link

@drnikolaev drnikolaev commented May 29, 2016

When GPU memory comes to its end GPUMemoryManager::GetInfo may report for available memory some value like 174MB and cudnnGetConvolutionForwardAlgorithm decides to take about 143MB out of it for better algorithm. Suddenly, cudaMalloc refuses to allocate 143MB out of the "last 174" and Caffe crashes. It has nothing to do with cache - I verified that no single deallocation happen before the OOM appears. So, the cache is empty. After debugging I found that Device Info values we maintain are diverging from reality when we get more and more allocations. Perhaps, due to cudaMalloc alignment or for some other reason. The fix updates the dev_info_ structure when we get allocation failure while calling new allocator try_allocate (as well as try_reserve). We use new try mechanism to ensure that convolution algorithm chooser does the right thing.

@drnikolaev drnikolaev merged commit d988833 into NVIDIA:caffe-0.15 Jun 1, 2016
@drnikolaev
Copy link
Author

Thanks @lukeyeager for reviewing the code. This PR will be followed by another one implementing better memory distribution algorithm for cuDNN Convolution Layer (as per our discussion).

@drnikolaev drnikolaev added the bug label Jun 3, 2016
@drnikolaev drnikolaev deleted the caffe-0.15-oom branch June 7, 2016 06:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant