Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUB Memory Manager + cuDNN v4 and v5 support #3919

Closed
wants to merge 3 commits into from

Conversation

drnikolaev
Copy link
Contributor

This PR adds two non-separatable features to Caffe: high performance CUB Memory Manager and long awaited upgrade from cuDNN v3 to cuDNN v4 (and upcoming v5) libraries.

@seanbell
Copy link

Right now Github refuses to render the entire diff (too many changes), so the only thing that one can see are the added 3rdparty files. A workaround could be to add all 3rdparty files in one commit, and put all other changes in a second commit.

@drnikolaev
Copy link
Contributor Author

OK, thank you. Split into 2 commits.

@@ -316,6 +317,21 @@ class Layer {
param_propagate_down_[param_id] = value;
}

bool IsForwardPassed() const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to not have these as a part of Layer? AFAICT you can have exactly the same effect by just putting these forward_passed/backward_passed as instance variables in CuDNNBatchNormalizationLayer, and you don't add (essentially) completely unused base functions to a core class.

IMO it's preferable to keep the core classes like Layer, Blob as small as possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of the annoyance of Caffe's lazy allocation. There are allocations that happen late, specifically setup of cuRand, so we need to be able to delay final setup of algorithm choice until we know how much memory we really have. (There is an upcoming PR to move to the findEx paths in cudnnv5 instead of get that make this even more challenging). We think we need this more generally. Would love a better solution, but we start going down a "plan" style path pretty quickly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Andrew, there are two reasons for keeping them in base class:

  1. In Net::ForwardFromTo and Net::BackwardFromTo we call setters using pointers to base classes (i.e. pure virtuals would be even more expensive here).
  2. Most probably there will be more use cases like this in other cuDNN-based layer implementations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I've misunderstood, it's hard to believe that there is any real performance implication due to 1.

If I'm imagining it correctly, what @thatguymike is calling a '"plan" style path' sounds right to me, though admittedly difficult to implement in current Caffe. [In fact, I have code of my own, (not as a Caffe branch) that takes that path, so it might be nice to discuss that elsewhere.]

Currently, however, this is a hack. Note that in current usage, memory usage can change (potentially drastically) after the initial forward-backward pass due to reshaping. So it's better not to push such things into the core of Caffe. If it becomes useful to share this kind of code among cuDNN layers, that can be done with another (intermediate) class, or some other kind of helper. This is also a (minor?) violation of modularity; layers are not really supposed to know anything about other layers or nets or whoever is running them.

I agree with @ajtulloch here -- I see no compelling reason to add these functions to all layers.

Copy link
Contributor Author

@drnikolaev drnikolaev May 4, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@longjon Now please look at net.cpp:

  • for (int i = start; i <= end; ++i) {
  • layers_[i]->ForwardPassed(true);
  • }

Here layers[i] is a pointer to the base class Layer. How would we call ForwardPassed here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't, of course. Instead, you'd need to set forward_passed_ in layer code, presumably at the end of Forward_*.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Perhaps the intention here was to keep track not simply of whether an individual layer's forward has completed, but whether all layers' forwards' have completed. But note that that's not what this code does anyway.)

@cuihenggang
Copy link

That's cool that we can do batch normalization using cuDNN. So when we use the CudnnBatchNorm layer, do we still need to append a Scale layer after it?

@borisgin
Copy link

borisgin commented Apr 5, 2016

No you don;t have to add scale/shift layers. CuDNN BN layer has both scale and shift inside.

@drnikolaev
Copy link
Contributor Author

Removed unnecessary files from 3rdparty directory. Just 5 of them left - those are the only required.

@antran89
Copy link
Contributor

antran89 commented Apr 7, 2016

Cool! How do I make changes to use this module? Do I need to add a new Layer before any ReLU layer? Or it will automatically do BN when I specify a variable in protxt file.

@mfernezir
Copy link

Regarding BN layer usage. I've recently posted in NVIDIA/DIGITS#629 about some differences between NVIDIA Caffe and BVLC version.

Since this PR has the same BatchNormParameter message in caffe.proto like the current NVIDIA's one, the following example should work in BVLC Caffe as well:

## BatchNorm
layer {
  bottom: "conv1/7x7_s2"
  name: "conv1/7x7_s2/bn"
  top: "conv1/7x7_s2/bn"
  type: "BatchNorm"
  param {
    lr_mult: 1
    decay_mult: 0
  }
  param {
    lr_mult: 1
    decay_mult: 0
  }
  batch_norm_param {
    scale_filler {
      type: "constant"
      value: 1
    }
    bias_filler {
      type: "constant"
      value: 0
    }
  }
}
layer {
  bottom: "conv1/7x7_s2/bn"
  top: "conv1/7x7_s2/bn"
  name: "conv1/relu_7x7"
  type: "ReLU"
}

There's no need to specify use_global_stats since Cafffe automatically infers the correct state TEST or TRAIN. The same layer definition is used for both train_val and deploy prototxt.

@drnikolaev
Copy link
Contributor Author

Added "Returned shift and scale back to BN layer" commit to make BN layer implementation consistent with the paper, cuDNN and other frameworks.

@borisgin
Copy link

Today If you want to use BN layer without scale and shift, then you can initialize these two parameters with 1 and 0, and set lr and weight decay to 0 in the train_val.prototxt. I can add a new parameter, which can do this automatically.
Boris

On Apr 13, 2016, at 2:03 PM, Sergei Nikolaev [email protected] wrote:

Added "Returned shift and scale back to BN layer" commit to make BN layer implementation consistent with the paper, cuDNN and other frameworks implementation.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub

void BatchNormLayer<Dtype>::compute_sum_per_channel_gpu(int N, int C, int S,
const Dtype *x, Dtype *y ) {
// assume that x.shape(1)==C
Blob<Dtype> temp_NC;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this was meant to be temp_NC_? Do you want to s/temp_NC/temp_NC_/g in these files?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally it was temp_NC_, but then I decided that it can be more safe to allocate temp_NC inside of these 2 functions to avoid potential hidden over-write (for example if I will use temp_NC_ outside) . I will re-check if this have noticable effect on time.

@@ -22,17 +23,29 @@ void BatchNormLayer<Dtype>::LayerSetUp(const vector<Blob<Dtype>*>& bottom,
if (this->blobs_.size() > 0) {
LOG(INFO) << "Skipping parameter initialization";
} else {
this->blobs_.resize(3);
this->blobs_.resize(5);
Copy link

@wlike wlike Apr 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is 5 blobs? blobs_[4] is not used in the current code. And according to usage of BatchNorm layer in the prototxt file given, there should be 4 blobs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blobs_[4] is referenced in a few places, but I agree that it looks like it's not used anywhere.

@borisgin can you comment on this?

@mathmanu
Copy link

If this BN method is used, will I be able to load a cafemodel (for finetuning) trained using the old BN method. I tried this in the nvidia/cafe repo and load of caffemodel exited complaining that the number of blobs doesn't match. This could be a critical need as there are several older caffemodels out there that we need to use.

@borisgin
Copy link

No. Old BN has 2 blobs: global mean and variance., New one has 5: scale, bias, gloabl_mean, and global_variance, global_counter

@mathmanu
Copy link

This could be a problem since several popular models become un-usable. (eg. https://github.com/KaimingHe/deep-residual-networks)

Do we really have to break the backward compatibility this way? In the old BN we used to have a separate scaling layer and that was fine.

Kindly re-consider this and try to keep the compatibility.

@mathmanu
Copy link

mathmanu commented Sep 25, 2016

Alternatively - you could provide an up gradation method as well - for example, while loading the older model for finetuning, the scale and bias blobs could be forced to one and zero respectively.

I also saw that the shape of the blobs (global_mean, and global_variance) were also different - although they were of same size - this also creates problem.

@mathmanu
Copy link

I agree that there is merit in what you did - by combining the normalization and scaling.

Yet another easy fix is to make this a different layer - say "BatchNormScale", and keep the older layer as it is as a separate layer for backward compatibility. This is so far the simplest (minimal code changes) solution that I could come up with.

@borisgin
Copy link

Agree, Adding new layer would be the simplest way.

@mathmanu
Copy link

So do you think you can do it the name change (and keep the old layer) right away? I would love to use this new layer with CUDNN as CUDNN gives me 2x boost in speed.

@mathmanu
Copy link

Do you have any convergence issue with CUDNN BatchNorm used in this PR? In the nvidia/caffe, I had to set the engine as CAFFE for BatchNorm to get convergence.

I was struggling with the convergence issue, but finally the following worked for me. Specifying the engine as CAFFE is important. CUDNN BatchNorm doesn't converge for me.

The following is the configuration that I used in nvidia/caffe version. I am posting it here because I think the underlying implementation is same.

layer {
name: "bn2"
bottom: "conv2"
top: "conv2"
type: "BatchNorm"
param { #scale
lr_mult: 1
decay_mult: 1
}
param { #shift/bias
lr_mult: 1
decay_mult: 1
}
param { #global mean
lr_mult: 0
decay_mult: 0
}
param { #global var
lr_mult: 0
decay_mult: 0
}
batch_norm_param {
scale_filler {
type: "constant"
value: 1
}
bias_filler {
type: "constant"
value: 0
}
engine: CAFFE
}
}

@mathmanu
Copy link

If the CUDNN BatchNorm converged, that would have given me overall 4x boost in speed, but now with the CAFFE engine for BatchNorm, I get only 2x boost in speed overall!

@mathmanu
Copy link

I have couple of more comments:

  1. If you change the oder of the blobs to: gloabl_mean, and global_variance, scale, bias, global_counter, then I don't have to specify 4 param fields for lr_mult and decay_mult - but only 2.
  2. If the definition of scale and bias fields in BatchNormParameter is changed to:
    optional float scale_filler = 5 [default = 1];
    optional float bias_filler = 6 [default = 0];
    Then I don't have to specify these also in the prototxt.

These changes will help someone who is trying to use this layer for the first time - apart from saving some space in the prototxt.

@mathmanu
Copy link

Need urgent help - any suggestions to solve the issue of non convergence with CUDNN BatchNorm? Are CUDNN BatchNorm implementations same in nvcaffe-0.15, nvcaffe-0.16 and this PR? (I tried nvcaffe-0.16) Any suggestions would help.

@borisgin
Copy link

I will re-check the convergence issue with cuDNN_BN.

On Mon, Sep 26, 2016 at 12:06 AM, mathmanu [email protected] wrote:

Do you have any convergence issue with CUDNN BatchNorm used in this PR? In
the nvidia/caffe, I had to set the engine as CAFFE for BatchNorm to get
convergence.

I was struggling with the convergence issue, but finally the following
worked for me. Specifying the engine as CAFFE is important. CUDNN BatchNorm
doesn't converge for me.

The following is the configuration that I used in nvidia/caffe version. I
am posting it here because I think the underlying implementation is same.

layer {
name: "bn2"
bottom: "conv2"
top: "conv2"
type: "BatchNorm"
param { #scale
lr_mult: 1
decay_mult: 1
}
param { #shift/bias
lr_mult: 1
decay_mult: 1
}
param { #global mean
lr_mult: 0
decay_mult: 0
}
param { #global var
lr_mult: 0
decay_mult: 0
}

batch_norm_param {
scale_filler {
type: "constant"
value: 1
}
bias_filler {
type: "constant"
value: 0
}
engine: CAFFE
}
}


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#3919 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AHMWqeRQiGb9XHjYfgv6FZabKteRaPvHks5qt28RgaJpZM4H8TpB
.

@borisgin
Copy link

The idea to change parameter order is very good. I will do this.

On Mon, Sep 26, 2016 at 7:55 AM, Boris Ginsburg [email protected]
wrote:

I will re-check the convergence issue with cuDNN_BN.

On Mon, Sep 26, 2016 at 12:06 AM, mathmanu [email protected]
wrote:

Do you have any convergence issue with CUDNN BatchNorm used in this PR?
In the nvidia/caffe, I had to set the engine as CAFFE for BatchNorm to get
convergence.

I was struggling with the convergence issue, but finally the following
worked for me. Specifying the engine as CAFFE is important. CUDNN BatchNorm
doesn't converge for me.

The following is the configuration that I used in nvidia/caffe version. I
am posting it here because I think the underlying implementation is same.

layer {
name: "bn2"
bottom: "conv2"
top: "conv2"
type: "BatchNorm"
param { #scale
lr_mult: 1
decay_mult: 1
}
param { #shift/bias
lr_mult: 1
decay_mult: 1
}
param { #global mean
lr_mult: 0
decay_mult: 0
}
param { #global var
lr_mult: 0
decay_mult: 0
}

batch_norm_param {
scale_filler {
type: "constant"
value: 1
}
bias_filler {
type: "constant"
value: 0
}
engine: CAFFE
}
}


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#3919 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AHMWqeRQiGb9XHjYfgv6FZabKteRaPvHks5qt28RgaJpZM4H8TpB
.

@mathmanu
Copy link

Thanks. How about the fillers - can you provide default values for them too?

@mathmanu
Copy link

When you check the convergence issue with cuDNN_BN, kindly do so with a network that has many BN layers. For example ResNet18.

@borisgin
Copy link

can you send me your solver.ptototxt and train_test.prototxt files for the
model which converg with caffe-engine and diverge with cudnn please?

On Mon, Sep 26, 2016 at 8:30 AM, mathmanu [email protected] wrote:

When you check the convergence issue with cuDNN_BN, kindly do so with a
network that has many BN layers. For example ResNet18.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#3919 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AHMWqSSO_kee46Wf5EhHH1yBfuHPIzuGks5qt-UngaJpZM4H8TpB
.

@mathmanu
Copy link

mathmanu commented Sep 27, 2016

I couldn't attach for some reason. I have copied the combined (solver + train) prototxt below. If you change the BatchNorm engine to CAFFE, it will start to converge.

#Sover parameters
test_iter: 200
test_interval: 1000
test_initialization: true
display: 100
base_lr: 0.01
lr_policy: "multistep"
stepvalue: 25000
stepvalue: 50000
stepvalue: 75000
stepvalue: 100000
gamma: 0.1
max_iter: 125000
momentum: 0.9
weight_decay: 1e-4
regularization_type: "L2" #"L1"
snapshot: 1000
snapshot_prefix: "training/resnet18"
solver_mode: GPU
random_seed: 33

#Net parameters
net_param {

name: "ResNet-18(1024)"

layer {
name: "data"
type: "Data"
top: "data"
top: "label"
include {
phase: TRAIN
}

transform_param {
crop_size: 224
mean_value: 128
mean_value: 128
mean_value: 128
mirror: true
}
data_param {
source: "/user/me/files/data/datasets/object-detect/other/ilsvrc/2012/ilsvrc12_train_lmdb"
batch_size: 64
backend: LMDB
}
}

layer {
name: "data"
type: "Data"
top: "data"
top: "label"
include {
phase: TEST
}

transform_param {
crop_size: 224
mean_value: 128
mean_value: 128
mean_value: 128
mirror: false
}
data_param {
source: "/user/me/files/data/datasets/object-detect/other/ilsvrc/2012/ilsvrc12_val_lmdb"
batch_size: 64
backend: LMDB
}
}
layer {
name: "conv1"
bottom: "data"
top: "conv1"
type: "Convolution"
param { lr_mult: 1 decay_mult: 1 }
param { lr_mult: 2 decay_mult: 0 }
convolution_param {
num_output: 64
kernel_size: 7
pad: 3
stride: 2
weight_filler { type: "msra" std: 0.010 }
bias_filler { type: "constant" value: 0 }
}
}
layer {
name: "bn_conv1"
bottom: "conv1"
top: "conv1"
type: "BatchNorm"
param { #scale
lr_mult: 1
decay_mult: 1
}
param { #shift/bias
lr_mult: 1
decay_mult: 1
}
param { #global mean
lr_mult: 0
decay_mult: 0
}
param { #global var
lr_mult: 0
decay_mult: 0
}
batch_norm_param {
scale_filler {
type: "constant"
value: 1
}
bias_filler {
type: "constant"
value: 0
}
engine: CUDNN
}
}
layer {
name: "conv1_relu"
bottom: "conv1"
top: "conv1"
type: "ReLU"
}
layer {
name: "pool1"
bottom: "conv1"
top: "pool1"
type: "Pooling"
pooling_param {
pool: MAX
kernel_size: 3
stride: 2
}
}
layer {
name: "res2a_branch2a"
bottom: "pool1"
top: "res2a_branch2a"
type: "Convolution"
param { lr_mult: 1 decay_mult: 1 }
convolution_param {
num_output: 64
kernel_size: 3
pad: 1
stride: 1
bias_term: false
weight_filler { type: "msra" std: 0.010 }
dilation: 1
group: 1
}
}
layer {
name: "bn2a_branch2a"
bottom: "res2a_branch2a"
top: "res2a_branch2a"
type: "BatchNorm"
param { #scale
lr_mult: 1
decay_mult: 1
}
param { #shift/bias
lr_mult: 1
decay_mult: 1
}
param { #global mean
lr_mult: 0
decay_mult: 0
}
param { #global var
lr_mult: 0
decay_mult: 0
}
batch_norm_param {
scale_filler {
type: "constant"
value: 1
}
bias_filler {
type: "constant"
value: 0
}
engine: CUDNN
}
}
layer {
name: "res2a_branch2a_relu"
bottom: "res2a_branch2a"
top: "res2a_branch2a"
type: "ReLU"
}
layer {
name: "res2a_branch2b"
bottom: "res2a_branch2a"
top: "res2a_branch2b"
type: "Convolution"
param { lr_mult: 1 decay_mult: 1 }
convolution_param {
num_output: 64
kernel_size: 3
pad: 1
stride: 1
bias_term: false
weight_filler { type: "msra" std: 0.010 }
dilation: 1
group: 1
}
}
layer {
name: "bn2a_branch2b"
bottom: "res2a_branch2b"
top: "res2a_branch2b"
type: "BatchNorm"
param { #scale
lr_mult: 1
decay_mult: 1
}
param { #shift/bias
lr_mult: 1
decay_mult: 1
}
param { #global mean
lr_mult: 0
decay_mult: 0
}
param { #global var
lr_mult: 0
decay_mult: 0
}
batch_norm_param {
scale_filler {
type: "constant"
value: 1
}
bias_filler {
type: "constant"
value: 0
}
engine: CUDNN
}
}
layer {
name: "res2a"
bottom: "pool1"
bottom: "res2a_branch2b"
top: "res2a"
type: "Eltwise"
}
layer {
name: "res2a_relu"
bottom: "res2a"
top: "res2a"
type: "ReLU"
}
layer {
name: "res2b_branch2a"
bottom: "res2a"
top: "res2b_branch2a"
type: "Convolution"
param { lr_mult: 1 decay_mult: 1 }
convolution_param {
num_output: 64
kernel_size: 3
pad: 1
stride: 1
bias_term: false
weight_filler { type: "msra" std: 0.010 }
dilation: 1
group: 1
}
}
layer {
name: "bn2b_branch2a"
bottom: "res2b_branch2a"
top: "res2b_branch2a"
type: "BatchNorm"
param { #scale
lr_mult: 1
decay_mult: 1
}
param { #shift/bias
lr_mult: 1
decay_mult: 1
}
param { #global mean
lr_mult: 0
decay_mult: 0
}
param { #global var
lr_mult: 0
decay_mult: 0
}
batch_norm_param {
scale_filler {
type: "constant"
value: 1
}
bias_filler {
type: "constant"
value: 0
}
engine: CUDNN
}
}
layer {
name: "res2b_branch2a_relu"
bottom: "res2b_branch2a"
top: "res2b_branch2a"
type: "ReLU"
}
layer {
name: "res2b_branch2b"
bottom: "res2b_branch2a"
top: "res2b_branch2b"
type: "Convolution"
param { lr_mult: 1 decay_mult: 1 }
convolution_param {
num_output: 64
kernel_size: 3
pad: 1
stride: 1
bias_term: false
weight_filler { type: "msra" std: 0.010 }
dilation: 1
group: 1
}
}
layer {
name: "bn2b_branch2b"
bottom: "res2b_branch2b"
top: "res2b_branch2b"
type: "BatchNorm"
param { #scale
lr_mult: 1
decay_mult: 1
}
param { #shift/bias
lr_mult: 1
decay_mult: 1
}
param { #global mean
lr_mult: 0
decay_mult: 0
}
param { #global var
lr_mult: 0
decay_mult: 0
}
batch_norm_param {
scale_filler {
type: "constant"
value: 1
}
bias_filler {
type: "constant"
value: 0
}
engine: CUDNN
}
}
layer {
name: "res2b"
bottom: "res2a"
bottom: "res2b_branch2b"
top: "res2b"
type: "Eltwise"
}
layer {
name: "res2b_relu"
bottom: "res2b"
top: "res2b"
type: "ReLU"
}
layer {
name: "res3a_branch2a"
bottom: "res2b"
top: "res3a_branch2a"
type: "Convolution"
param { lr_mult: 1 decay_mult: 1 }
convolution_param {
num_output: 128
kernel_size: 3
pad: 1
stride: 2
bias_term: false
weight_filler { type: "msra" std: 0.010 }
dilation: 1
group: 1
}
}
layer {
name: "bn3a_branch2a"
bottom: "res3a_branch2a"
top: "res3a_branch2a"
type: "BatchNorm"
param { #scale
lr_mult: 1
decay_mult: 1
}
param { #shift/bias
lr_mult: 1
decay_mult: 1
}
param { #global mean
lr_mult: 0
decay_mult: 0
}
param { #global var
lr_mult: 0
decay_mult: 0
}
batch_norm_param {
scale_filler {
type: "constant"
value: 1
}
bias_filler {
type: "constant"
value: 0
}
engine: CUDNN
}
}
layer {
name: "res3a_branch2a_relu"
bottom: "res3a_branch2a"
top: "res3a_branch2a"
type: "ReLU"
}
layer {
name: "res3a_branch2b"
bottom: "res3a_branch2a"
top: "res3a_branch2b"
type: "Convolution"
param { lr_mult: 1 decay_mult: 1 }
convolution_param {
num_output: 128
kernel_size: 3
pad: 1
stride: 1
bias_term: false
weight_filler { type: "msra" std: 0.010 }
dilation: 1
group: 1
}
}
layer {
name: "bn3a_branch2b"
bottom: "res3a_branch2b"
top: "res3a_branch2b"
type: "BatchNorm"
param { #scale
lr_mult: 1
decay_mult: 1
}
param { #shift/bias
lr_mult: 1
decay_mult: 1
}
param { #global mean
lr_mult: 0
decay_mult: 0
}
param { #global var
lr_mult: 0
decay_mult: 0
}
batch_norm_param {
scale_filler {
type: "constant"
value: 1
}
bias_filler {
type: "constant"
value: 0
}
engine: CUDNN
}
}
layer {
name: "res3a_branch1"
bottom: "res2b"
top: "res3a_branch1"
type: "Convolution"
param { lr_mult: 1 decay_mult: 1 }
convolution_param {
num_output: 128
kernel_size: 1
pad: 0
stride: 2
bias_term: false
weight_filler { type: "msra" std: 0.010 }
dilation: 1
group: 1
}
}
layer {
name: "bn3a_branch1"
bottom: "res3a_branch1"
top: "res3a_branch1"
type: "BatchNorm"
param { #scale
lr_mult: 1
decay_mult: 1
}
param { #shift/bias
lr_mult: 1
decay_mult: 1
}
param { #global mean
lr_mult: 0
decay_mult: 0
}
param { #global var
lr_mult: 0
decay_mult: 0
}
batch_norm_param {
scale_filler {
type: "constant"
value: 1
}
bias_filler {
type: "constant"
value: 0
}
engine: CUDNN
}
}
layer {
name: "res3a"
bottom: "res3a_branch1"
bottom: "res3a_branch2b"
top: "res3a"
type: "Eltwise"
}
layer {
name: "res3a_relu"
bottom: "res3a"
top: "res3a"
type: "ReLU"
}
layer {
name: "res3b_branch2a"
bottom: "res3a"
top: "res3b_branch2a"
type: "Convolution"
param { lr_mult: 1 decay_mult: 1 }
convolution_param {
num_output: 128
kernel_size: 3
pad: 1
stride: 1
bias_term: false
weight_filler { type: "msra" std: 0.010 }
dilation: 1
group: 1
}
}
layer {
name: "bn3b_branch2a"
bottom: "res3b_branch2a"
top: "res3b_branch2a"
type: "BatchNorm"
param { #scale
lr_mult: 1
decay_mult: 1
}
param { #shift/bias
lr_mult: 1
decay_mult: 1
}
param { #global mean
lr_mult: 0
decay_mult: 0
}
param { #global var
lr_mult: 0
decay_mult: 0
}
batch_norm_param {
scale_filler {
type: "constant"
value: 1
}
bias_filler {
type: "constant"
value: 0
}
engine: CUDNN
}
}
layer {
name: "res3b_branch2a_relu"
bottom: "res3b_branch2a"
top: "res3b_branch2a"
type: "ReLU"
}
layer {
name: "res3b_branch2b"
bottom: "res3b_branch2a"
top: "res3b_branch2b"
type: "Convolution"
param { lr_mult: 1 decay_mult: 1 }
convolution_param {
num_output: 128
kernel_size: 3
pad: 1
stride: 1
bias_term: false
weight_filler { type: "msra" std: 0.010 }
dilation: 1
group: 1
}
}
layer {
name: "bn3b_branch2b"
bottom: "res3b_branch2b"
top: "res3b_branch2b"
type: "BatchNorm"
param { #scale
lr_mult: 1
decay_mult: 1
}
param { #shift/bias
lr_mult: 1
decay_mult: 1
}
param { #global mean
lr_mult: 0
decay_mult: 0
}
param { #global var
lr_mult: 0
decay_mult: 0
}
batch_norm_param {
scale_filler {
type: "constant"
value: 1
}
bias_filler {
type: "constant"
value: 0
}
engine: CUDNN
}
}
layer {
name: "res3b"
bottom: "res3a"
bottom: "res3b_branch2b"
top: "res3b"
type: "Eltwise"
}
layer {
name: "res3b_relu"
bottom: "res3b"
top: "res3b"
type: "ReLU"
}
layer {
name: "res4a_branch2a"
bottom: "res3b"
top: "res4a_branch2a"
type: "Convolution"
param { lr_mult: 1 decay_mult: 1 }
convolution_param {
num_output: 256
kernel_size: 3
pad: 1
stride: 2
bias_term: false
weight_filler { type: "msra" std: 0.010 }
dilation: 1
group: 1
}
}
layer {
name: "bn4a_branch2a"
bottom: "res4a_branch2a"
top: "res4a_branch2a"
type: "BatchNorm"
param { #scale
lr_mult: 1
decay_mult: 1
}
param { #shift/bias
lr_mult: 1
decay_mult: 1
}
param { #global mean
lr_mult: 0
decay_mult: 0
}
param { #global var
lr_mult: 0
decay_mult: 0
}
batch_norm_param {
scale_filler {
type: "constant"
value: 1
}
bias_filler {
type: "constant"
value: 0
}
engine: CUDNN
}
}
layer {
name: "res4a_branch2a_relu"
bottom: "res4a_branch2a"
top: "res4a_branch2a"
type: "ReLU"
}
layer {
name: "res4a_branch2b"
bottom: "res4a_branch2a"
top: "res4a_branch2b"
type: "Convolution"
param { lr_mult: 1 decay_mult: 1 }
convolution_param {
num_output: 256
kernel_size: 3
pad: 1
stride: 1
bias_term: false
weight_filler { type: "msra" std: 0.010 }
dilation: 1
group: 1
}
}
layer {
name: "bn4a_branch2b"
bottom: "res4a_branch2b"
top: "res4a_branch2b"
type: "BatchNorm"
param { #scale
lr_mult: 1
decay_mult: 1
}
param { #shift/bias
lr_mult: 1
decay_mult: 1
}
param { #global mean
lr_mult: 0
decay_mult: 0
}
param { #global var
lr_mult: 0
decay_mult: 0
}
batch_norm_param {
scale_filler {
type: "constant"
value: 1
}
bias_filler {
type: "constant"
value: 0
}
engine: CUDNN
}
}
layer {
name: "res4a_branch1"
bottom: "res3b"
top: "res4a_branch1"
type: "Convolution"
param { lr_mult: 1 decay_mult: 1 }
convolution_param {
num_output: 256
kernel_size: 1
pad: 0
stride: 2
bias_term: false
weight_filler { type: "msra" std: 0.010 }
dilation: 1
group: 1
}
}
layer {
name: "bn4a_branch1"
bottom: "res4a_branch1"
top: "res4a_branch1"
type: "BatchNorm"
param { #scale
lr_mult: 1
decay_mult: 1
}
param { #shift/bias
lr_mult: 1
decay_mult: 1
}
param { #global mean
lr_mult: 0
decay_mult: 0
}
param { #global var
lr_mult: 0
decay_mult: 0
}
batch_norm_param {
scale_filler {
type: "constant"
value: 1
}
bias_filler {
type: "constant"
value: 0
}
engine: CUDNN
}
}
layer {
name: "res4a"
bottom: "res4a_branch1"
bottom: "res4a_branch2b"
top: "res4a"
type: "Eltwise"
}
layer {
name: "res4a_relu"
bottom: "res4a"
top: "res4a"
type: "ReLU"
}
layer {
name: "res4b_branch2a"
bottom: "res4a"
top: "res4b_branch2a"
type: "Convolution"
param { lr_mult: 1 decay_mult: 1 }
convolution_param {
num_output: 256
kernel_size: 3
pad: 1
stride: 1
bias_term: false
weight_filler { type: "msra" std: 0.010 }
dilation: 1
group: 1
}
}
layer {
name: "bn4b_branch2a"
bottom: "res4b_branch2a"
top: "res4b_branch2a"
type: "BatchNorm"
param { #scale
lr_mult: 1
decay_mult: 1
}
param { #shift/bias
lr_mult: 1
decay_mult: 1
}
param { #global mean
lr_mult: 0
decay_mult: 0
}
param { #global var
lr_mult: 0
decay_mult: 0
}
batch_norm_param {
scale_filler {
type: "constant"
value: 1
}
bias_filler {
type: "constant"
value: 0
}
engine: CUDNN
}
}
layer {
name: "res4b_branch2a_relu"
bottom: "res4b_branch2a"
top: "res4b_branch2a"
type: "ReLU"
}
layer {
name: "res4b_branch2b"
bottom: "res4b_branch2a"
top: "res4b_branch2b"
type: "Convolution"
param { lr_mult: 1 decay_mult: 1 }
convolution_param {
num_output: 256
kernel_size: 3
pad: 1
stride: 1
bias_term: false
weight_filler { type: "msra" std: 0.010 }
dilation: 1
group: 1
}
}
layer {
name: "bn4b_branch2b"
bottom: "res4b_branch2b"
top: "res4b_branch2b"
type: "BatchNorm"
param { #scale
lr_mult: 1
decay_mult: 1
}
param { #shift/bias
lr_mult: 1
decay_mult: 1
}
param { #global mean
lr_mult: 0
decay_mult: 0
}
param { #global var
lr_mult: 0
decay_mult: 0
}
batch_norm_param {
scale_filler {
type: "constant"
value: 1
}
bias_filler {
type: "constant"
value: 0
}
engine: CUDNN
}
}
layer {
name: "res4b"
bottom: "res4a"
bottom: "res4b_branch2b"
top: "res4b"
type: "Eltwise"
}
layer {
name: "res4b_relu"
bottom: "res4b"
top: "res4b"
type: "ReLU"
}
layer {
name: "res5a_branch2a"
bottom: "res4b"
top: "res5a_branch2a"
type: "Convolution"
param { lr_mult: 1 decay_mult: 1 }
convolution_param {
num_output: 512
kernel_size: 3
pad: 1
stride: 2
bias_term: false
weight_filler { type: "msra" std: 0.010 }
dilation: 1
group: 1
}
}
layer {
name: "bn5a_branch2a"
bottom: "res5a_branch2a"
top: "res5a_branch2a"
type: "BatchNorm"
param { #scale
lr_mult: 1
decay_mult: 1
}
param { #shift/bias
lr_mult: 1
decay_mult: 1
}
param { #global mean
lr_mult: 0
decay_mult: 0
}
param { #global var
lr_mult: 0
decay_mult: 0
}
batch_norm_param {
scale_filler {
type: "constant"
value: 1
}
bias_filler {
type: "constant"
value: 0
}
engine: CUDNN
}
}
layer {
name: "res5a_branch2a_relu"
bottom: "res5a_branch2a"
top: "res5a_branch2a"
type: "ReLU"
}
layer {
name: "res5a_branch2b"
bottom: "res5a_branch2a"
top: "res5a_branch2b"
type: "Convolution"
param { lr_mult: 1 decay_mult: 1 }
convolution_param {
num_output: 512
kernel_size: 3
pad: 1
stride: 1
bias_term: false
weight_filler { type: "msra" std: 0.010 }
dilation: 1
group: 1
}
}
layer {
name: "bn5a_branch2b"
bottom: "res5a_branch2b"
top: "res5a_branch2b"
type: "BatchNorm"
param { #scale
lr_mult: 1
decay_mult: 1
}
param { #shift/bias
lr_mult: 1
decay_mult: 1
}
param { #global mean
lr_mult: 0
decay_mult: 0
}
param { #global var
lr_mult: 0
decay_mult: 0
}
batch_norm_param {
scale_filler {
type: "constant"
value: 1
}
bias_filler {
type: "constant"
value: 0
}
engine: CUDNN
}
}
layer {
name: "res5a_branch1"
bottom: "res4b"
top: "res5a_branch1"
type: "Convolution"
param { lr_mult: 1 decay_mult: 1 }
convolution_param {
num_output: 512
kernel_size: 1
pad: 0
stride: 2
bias_term: false
weight_filler { type: "msra" std: 0.010 }
dilation: 1
group: 1
}
}
layer {
name: "bn5a_branch1"
bottom: "res5a_branch1"
top: "res5a_branch1"
type: "BatchNorm"
param { #scale
lr_mult: 1
decay_mult: 1
}
param { #shift/bias
lr_mult: 1
decay_mult: 1
}
param { #global mean
lr_mult: 0
decay_mult: 0
}
param { #global var
lr_mult: 0
decay_mult: 0
}
batch_norm_param {
scale_filler {
type: "constant"
value: 1
}
bias_filler {
type: "constant"
value: 0
}
engine: CUDNN
}
}
layer {
name: "res5a"
bottom: "res5a_branch1"
bottom: "res5a_branch2b"
top: "res5a"
type: "Eltwise"
}
layer {
name: "res5a_relu"
bottom: "res5a"
top: "res5a"
type: "ReLU"
}
layer {
name: "res5b_branch2a"
bottom: "res5a"
top: "res5b_branch2a"
type: "Convolution"
param { lr_mult: 1 decay_mult: 1 }
convolution_param {
num_output: 512
kernel_size: 3
pad: 1
stride: 1
bias_term: false
weight_filler { type: "msra" std: 0.010 }
dilation: 1
group: 1
}
}
layer {
name: "bn5b_branch2a"
bottom: "res5b_branch2a"
top: "res5b_branch2a"
type: "BatchNorm"
param { #scale
lr_mult: 1
decay_mult: 1
}
param { #shift/bias
lr_mult: 1
decay_mult: 1
}
param { #global mean
lr_mult: 0
decay_mult: 0
}
param { #global var
lr_mult: 0
decay_mult: 0
}
batch_norm_param {
scale_filler {
type: "constant"
value: 1
}
bias_filler {
type: "constant"
value: 0
}
engine: CUDNN
}
}
layer {
name: "res5b_branch2a_relu"
bottom: "res5b_branch2a"
top: "res5b_branch2a"
type: "ReLU"
}
layer {
name: "res5b_branch2b"
bottom: "res5b_branch2a"
top: "res5b_branch2b"
type: "Convolution"
param { lr_mult: 1 decay_mult: 1 }
convolution_param {
num_output: 512
kernel_size: 3
pad: 1
stride: 1
bias_term: false
weight_filler { type: "msra" std: 0.010 }
dilation: 1
group: 1
}
}
layer {
name: "bn5b_branch2b"
bottom: "res5b_branch2b"
top: "res5b_branch2b"
type: "BatchNorm"
param { #scale
lr_mult: 1
decay_mult: 1
}
param { #shift/bias
lr_mult: 1
decay_mult: 1
}
param { #global mean
lr_mult: 0
decay_mult: 0
}
param { #global var
lr_mult: 0
decay_mult: 0
}
batch_norm_param {
scale_filler {
type: "constant"
value: 1
}
bias_filler {
type: "constant"
value: 0
}
engine: CUDNN
}
}
layer {
name: "res5b"
bottom: "res5a"
bottom: "res5b_branch2b"
top: "res5b"
type: "Eltwise"
}
layer {
name: "res5b_relu"
bottom: "res5b"
top: "res5b"
type: "ReLU"
}
layer {
name: "pool5"
bottom: "res5b"
top: "pool5"
type: "Pooling"
pooling_param {
pool: AVE
kernel_size: 7
stride: 1
}
}
layer {
name: "conv6"
bottom: "pool5"
top: "conv6"
type: "Convolution"
param { lr_mult: 1 decay_mult: 1 }
param { lr_mult: 2 decay_mult: 0 }
convolution_param {
num_output: 1024
kernel_size: 1
pad: 0
stride: 1
weight_filler { type: "msra" std: 0.010 }
bias_filler { type: "constant" value: 0 }
}
}
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "conv6"
bottom: "label"
propagate_down: 1
propagate_down: 0
top: "loss"
loss_weight: 1
}

layer {
name: "accuracy"
type: "Accuracy"
bottom: "conv6"
bottom: "label"
top: "accuracy"
include {
phase: TEST
}
}
layer {
name: "accuracy/top-5"
type: "Accuracy"
bottom: "conv6"
bottom: "label"
top: "accuracy/top-5"
include {
phase: TEST
}
accuracy_param {
top_k: 5
}
}

}

@mathmanu
Copy link

Were you able to reproduce the behavior?

@borisgin
Copy link

I was able to reproduce the problem: caffe engine is converging, but cudnn
BN diverges
Btw, I would add two parameters to BN layer definition:
moving_average_fraction: 0.9
eps: 0.0001

On Tue, Sep 27, 2016 at 9:24 AM, mathmanu [email protected] wrote:

Were you able to reproduce the behavior?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#3919 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AHMWqQfqa0wtVvY1DtgCwjtLNsQB1TXTks5quUMpgaJpZM4H8TpB
.

@mathmanu
Copy link

Thankyou so much for confirming. I'll wait for further information from you.

@mathmanu
Copy link

Btw, I can see that the additional parameters that you mentioned have default values - so technically I don't need to specify them. Do the values suggested by you produce better accuracy?

// How much does the moving average decay each iteration?
optional float moving_average_fraction = 2 [default = .999];
// Small value to add to the variance estimate so that we don't divide by
// zero.
optional float eps = 3 [default = 1e-5];

@mathmanu
Copy link

Is there a quick fix or a workaround that can solve this issue?
Thanks,

@borisgin
Copy link

Yes. CUDNN_BN requires different top and bottom since it needs bottom for backward. I attached the fixed train_val.prototxt (I also did a few additional minor changes: change last layer from conv 1x1 to IP with num_of_outputs=1000).

train_val.txt

@mathmanu
Copy link

mathmanu commented Sep 30, 2016

Thanks. This helped a lot.

Btw, I have an observation. I have a network trained with the CAFFE BN engine. When I tried to TEST it using CUDNN BN engine, Caffe exited saying that the shapes of blobs in BN mismatch. But since the blobs are same size (but different shapes), I was able to forcefully reshape the blobs and do the TEST - and it gave correct results!

@borisgin
Copy link

borisgin commented Oct 2, 2016

My bug :(

On Thu, Sep 29, 2016 at 7:29 PM, mathmanu [email protected] wrote:

Thanks. This helped a lot.

Btw, I have an observation. I have a network trained with the CAFFE BN
engine. When I tried to TEST it using CUDNN BN engine, Caffe exited saying
that the shapes of blobs in BN mismatch. But since the blobs are same sime
(but different shapes), I was able to forcefully reshape the blobs and do
the TEST - and it gave correct results!


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#3919 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AHMWqcY9KxMmkkoWXywAP9sfDYDUHFFSks5qvHQHgaJpZM4H8TpB
.

@mathmanu
Copy link

mathmanu commented Oct 3, 2016

Don't worry - CUDNN is great - it gives me 4x speed boost - that makes a huge difference.

You just need to do slight changes and testing - cross compatibility test to CAFE engine and backward compatibility test.

@mathmanu
Copy link

See another thread with a similar issue being reported: NVIDIA/DIGITS#629

@achaiah
Copy link

achaiah commented Dec 1, 2016

@borisgin Out of curiosity, have you tried larger networks than Resnet-18 on NVIDIA? Your resnet-18 is the only one that converges for me while I can't find a single example of a larger Resnet that does.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.