Re-evaluating device info when get closer to zero mem #153

drnikolaev · 2016-05-29T07:48:35Z

When GPU memory comes to its end GPUMemoryManager::GetInfo may report for available memory some value like 174MB and cudnnGetConvolutionForwardAlgorithm decides to take about 143MB out of it for better algorithm. Suddenly, cudaMalloc refuses to allocate 143MB out of the "last 174" and Caffe crashes. It has nothing to do with cache - I verified that no single deallocation happen before the OOM appears. So, the cache is empty. After debugging I found that Device Info values we maintain are diverging from reality when we get more and more allocations. Perhaps, due to cudaMalloc alignment or for some other reason. The fix updates the dev_info_ structure when we get allocation failure while calling new allocator try_allocate (as well as try_reserve). We use new try mechanism to ensure that convolution algorithm chooser does the right thing.

…fe-0.15-oom Conflicts: src/caffe/util/gpu_memory.cpp

drnikolaev · 2016-06-01T06:58:42Z

Thanks @lukeyeager for reviewing the code. This PR will be followed by another one implementing better memory distribution algorithm for cuDNN Convolution Layer (as per our discussion).

drnikolaev added 4 commits May 29, 2016 00:10

Re-evaluating device info when get closer to zero mem

9173569

Re-evaluating device info when get closer to zero mem

9141deb

Merge branch 'caffe-0.15-oom' of github.com:drnikolaev/caffe into caf…

c4efe48

…fe-0.15-oom Conflicts: src/caffe/util/gpu_memory.cpp

Lint error fix

1d4882b

drnikolaev merged commit d988833 into NVIDIA:caffe-0.15 Jun 1, 2016

drnikolaev added the bug label Jun 3, 2016

drnikolaev deleted the caffe-0.15-oom branch June 7, 2016 06:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-evaluating device info when get closer to zero mem #153

Re-evaluating device info when get closer to zero mem #153

drnikolaev commented May 29, 2016 •

edited

Loading

drnikolaev commented Jun 1, 2016

Re-evaluating device info when get closer to zero mem #153

Re-evaluating device info when get closer to zero mem #153

Conversation

drnikolaev commented May 29, 2016 • edited Loading

drnikolaev commented Jun 1, 2016

drnikolaev commented May 29, 2016 •

edited

Loading