Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nd conv pool #2824

Open
wants to merge 14 commits into
base: master
Choose a base branch
from
Open

Nd conv pool #2824

wants to merge 14 commits into from

Conversation

WGW101
Copy link

@WGW101 WGW101 commented Jul 27, 2015

Hi !

Following my issue ticket #2671, here is my pull request for nD convolution and pooling using CuDNN primitives.

The nD convolution by itself seems to work, but the biases addition using cudnnAddTensor() returns NOT_SUPPORTED status.

The nD pooling doesn't work, and returns NOT_SUPPORTED.
Apparently the nD pooling descriptor might only be a place-holder in this version, so this might work with future version of cuDNN...

I inherits my layers directly from the Layer class and not the BaseConvolutionLayer and BasePoolingLayer to avoid modifying any existing (and working..) features.
The major drawback of this approach is that it won't be able to rely on other engines if CuDNN is not supported by the user configuration. But as I declared a LayerFactory for NdConvolution and NdPooling, it might be relatively easy to solve this behaviour.

Don't hesitate to give me feedbacks on this two new layers,
and share any new insight about why it doesn't work.

I'm already aware of the #2049 PR for nD convolution, but I'm still missing nD pooling (actually I only need 3D pooling in my application).

Cheers,

@Yeongtae
Copy link

Yeongtae commented Aug 3, 2015

I checked this branch. but I couldn't run, because of error "num_axess() <= 4" in blob.hpp and base_conv_layer.cpp.

@WGW101
Copy link
Author

WGW101 commented Aug 3, 2015

The two new layers I added don't use the base_conv or base_pool class.
To use them you might want to change the layer type from "Convolution" to "NdConvolution" (see layer_factory.cpp), then describe your kernel, stride and pad shape using BlobShape messages (see caffe.proto).

Can I see your prototxt file ?

@Yeongtae
Copy link

Yeongtae commented Aug 3, 2015

@WGW101 Thank you for response. I modified Convolution to NdConvolution. But it show an error "not implemented yet". I didn't use CUDNN. It made this error. Is it right?

@WGW101
Copy link
Author

WGW101 commented Aug 3, 2015

Yes, unfortunately it only works with cudnn for now.
(actually everything doesn't even work with it, but I'm waiting for v3 which should come out very soon).

For an implementation of Nd convolution with the caffe engine, see PR #2049 by Jeff Donahue.

@Yeongtae
Copy link

Yeongtae commented Aug 3, 2015

@WGW101 I'm using #2049 and #2442. I think second one is better.

In addition, I'm touching 3D convolution for action classification from video to extract spatial and temporal feature. I'm very confusing to handle networks which are blobs, e.g weights of conv, pool and ip. Because ND data can't use matcaffe. Do you have any idea for this?

@WGW101
Copy link
Author

WGW101 commented Aug 3, 2015

@Yeongtae The python interface is quite easy to understand and very similar to what Matlab could look like.

Let's say you load your network like this:
net = caffe.Net(path/to/your/prototxt, caffe.TEST)

Then the weights are available like this:
net.params.["LayerName"][0].data

and the biases like this:
net.params.["LayerName"][1].data

That would work for Conv and IP, not for pooling as it has no parameters..
There are a few python notebook tutorials in the base caffe repo, take a look at it for more info.

@Yeongtae
Copy link

Yeongtae commented Aug 5, 2015

@WGW101 Follow your advices, I have solved my problem.

I'm verifying that 3D convolution is right, using the convn function in matlab.
I added the bias of conv1 to the result of the convn function.
To use it, I extracted a data, a result of conv1, a weight of conv1 using pycaffe.

After testing, I check an weird result.

As an input is ones(n,n,n), the difference between a caffe result and a matlab result shows that all element are same to 1.0e-0.6* -0.4992.
As an input is rand(n.n.n), the difference between a caffe result and a matlab result shows that all element are different values

Therefore, It means that 3D convolution of this branch and matlab are different.

Do you have any idea for this?
And
Do you think that nD conv and nD poolling are well implemented?

@Yeongtae
Copy link

Yeongtae commented Aug 5, 2015

Using imfilter with the region without padding, it shows very small error 1.0e-0.6*n
I think I have done verifying this branch.

@dhkim19e
Copy link

Hi!

I noticed that CuDNN was updated in this week (In v3 RC, cudnnAddTensor was not supported).

So I checked with the new release, and this PR works fine just changing the function cudnnAddTensor to cudnnAddTensor_v3 (in the new API, second parameter 'mode' was removed).

Thanks!

@WGW101
Copy link
Author

WGW101 commented Sep 14, 2015

@squall815

Hi !

Thanks for your feedback !

I'm sorry I wasn't able to test this PR myself with the new version of CuDNN as my hardware isn't supported by CUDA 7.0 (required for cudnn v3 ..)

I hope I'll be able to resume the development of this branch some day (cleaning up everything to pass all tests, adding CPU / Caffe Engine with #2049 and #2442 integrated with the BlobShape message and the separated layer to keep the best performances in 2D etc..)

@rockstone533
Copy link

@Yeongtae I trained volume data, and the input is in hdf5 format.When I use matcaffe to parse caffemodel, I got an error below.Do you know how to solve it?
Check failed: num_kernel_dims == 1 || num_kernel_dims == num_spatial_axes_ kernel_size must be specified once, or once per spatial dimension (kernel_size specified 3 times; 2 spatial dims);

@Yeongtae
Copy link

Yeongtae commented Oct 6, 2015

I just use pycaffe.

@rockstone533
Copy link

@Yeongtae What's about your input data format?Do you use hdf5?

@Yeongtae
Copy link

Yeongtae commented Oct 6, 2015

Yes. I used it.

@Yeongtae
Copy link

Yeongtae commented Oct 6, 2015

Do you need some example?

@rockstone533
Copy link

@Yeongtae Yeah, it couldn't be better!

@rockstone533
Copy link

Hey, @WGW101, I wanna know whether your current version support Nd convolution with Cudnn?

@WGW101
Copy link
Author

WGW101 commented Oct 7, 2015

@rockstone533 Hi ! Yes it should if you don't have biases.
If you do, @squall815 suggested a minor modification to make it work:
change cudnnAddTensor to cudnnAddTensor_v3

Sorry I can't test it myself for material incompatibility reasons...

@rockstone533
Copy link

@WGW101 Yeah, I've changed it and my model began to work. However, the speed seems a bit slow. How about your running speed? @squall815

@ToruHironaka
Copy link

@WGW101, I used this promotion but I was no sure about train-val.prototxt layer setting. Here is what I did so far. I use libcudnn.so.7.0.

  1. I changed the layer names: Convolution --> NdConvolution and Pooling --> NdPooling
  2. Change the engine from CAFFE to CUDNN
  3. Added kernel_shape like below

pooling_param {
kernel_shape { dim: 2 dim: 1 dim: 20 dim: 20 dim: 20 }
pool: MAX
kernel_size: 3
stride: 2
}

I have a question here. I think I have to increase a number of kernel_size because I have 3D data 20x20x20 so I have to set 2 more kernel_size but I always got below message

" Error parsing text-format caffe.NetParameter: 47:16: Non-repeated field "kernel_size" is specified multiple times"

Promotion #2049 required to increase the number of kernel_size in order to train 3D data but this promotion required me to set kernel_shape instead of increasing a number of kernel_size. So, I think I did not get 3D layer.

I could train my layer till Iteration 0, Testing net (#0) but I got the error below. I think my kernel_shape setting was not correct.

F1208 15:44:10.254520 28197 cudnn_ndconv_layer.cu:43] Check failed: status == CUDNN_STATUS_SUCCESS (3 vs. 0) CUDNN_STATUS_BAD_PARAM

@WGW101
Copy link
Author

WGW101 commented Dec 9, 2015

@ToruHironaka Hi !

First you shouldn't use kernel_size, nor stride to implement the NdPooling layer added by this PR, they are replaced by kernel_shape and stride_shape instead.
I clearly need to raise an error if both are specified.

In the current implementation the kernel_size, stride and pad of the master branch are simply ignored, and kernel_shape is required, stride_shape default to all 1 and pad_shape default to all 0. It is likely to change to be more adaptive in future versions.

Be careful not to confuse the shape of your kernel and the shape of your input.
From what I understand, here is what your pooling layer should look like in your .prototxt:

layer {
  name: "XXX"
  type: "NdPooling"
  bottom: "yyy" // This is your 2x1x20x20x20 data blob
  top: "xxx" // You'll get a 2x1x9x9x9 output blob

  pooling_param {
    pool: MAX
    kernel_shape { // This is your 3x3x3 kernel
      dim: 3
      dim: 3
      dim: 3
    }
    stride_shape { // And 2x2x2 stride.
      dim: 2
      dim: 2
      dim: 2
    }
  }
}

If any error persist feel free to ask for help again.

Regards

@ToruHironaka
Copy link

@WGW101, thanks for your reply, I really appreciated it. I tried below layer model but I got the same problem.

<omitted data layer, I use hdf5 datasets >
layer {
name: "conv1"
type: "NdConvolution"
bottom: "data"
top: "conv1"
param {
lr_mult: 1
decay_mult: 1
}
param {
lr_mult: 2
decay_mult: 0
}
convolution_param {
num_output: 96
kernel_shape {
dim: 11
dim: 11
dim: 11
}
stride_shape {
dim: 4
dim: 4
dim: 4
}
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0
}
engine: CUDNN
}
}
layer {
name: "pool1"
type: "NdPooling"
bottom: "conv1"
top: "pool1"
pooling_param {
pool: MAX
kernel_shape {
dim: 3
dim: 3
dim: 3
}
stride_shape {
dim: 2
dim: 2
dim: 2
}
engine: CUDNN
}
}

layer {
name: "ip1"
type: "InnerProduct"
bottom: "pool1"
top: "ip1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 2
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}

layer {
name: "accuracy"
type: "Accuracy"
bottom: "ip1"
bottom: "label"
top: "accuracy"
include {
phase: TEST
}
}
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "ip1"
bottom: "label"
top: "loss"
}

Error:

I1209 19:53:55.656716 7984 net.cpp:155] Setting up data
I1209 19:53:55.656774 7984 net.cpp:163] Top shape: 2 1 20 20 20 (16000)
I1209 19:53:55.656800 7984 net.cpp:163] Top shape: 2 (2)
I1209 19:53:55.656813 7984 net.cpp:174] Memory required for data: 64008
I1209 19:53:55.656849 7984 layer_factory.hpp:76] Creating layer conv1
I1209 19:53:55.656950 7984 net.cpp:110] Creating Layer conv1
I1209 19:53:55.656983 7984 net.cpp:477] conv1 <- data
I1209 19:53:55.657047 7984 net.cpp:433] conv1 -> conv1
I1209 19:53:55.876652 7984 net.cpp:155] Setting up conv1
I1209 19:53:55.876729 7984 net.cpp:163] Top shape: 2 96 3 3 3 (5184)
I1209 19:53:55.876742 7984 net.cpp:174] Memory required for data: 84744
I1209 19:53:55.876843 7984 layer_factory.hpp:76] Creating layer pool1
I1209 19:53:55.876912 7984 net.cpp:110] Creating Layer pool1
I1209 19:53:55.876935 7984 net.cpp:477] pool1 <- conv1
I1209 19:53:55.876974 7984 net.cpp:433] pool1 -> pool1
F1209 19:53:55.877362 7984 cudnn.hpp:87] Check failed: status == CUDNN_STATUS_SUCCESS (3 vs. 0) CUDNN_STATUS_BAD_PARAM
*** Check failure stack trace: ***
@ 0x7f7c5ae20daa (unknown)
@ 0x7f7c5ae20ce4 (unknown)
@ 0x7f7c5ae206e6 (unknown)
@ 0x7f7c5ae23687 (unknown)
@ 0x7f7c5b632fbc caffe::cudnn::setTensorNdDesc<>()
@ 0x7f7c5b63250a caffe::cudnn::setTensorNdDesc<>()
@ 0x7f7c5b63025b caffe::CudnnNdPoolingLayer<>::Reshape()
@ 0x7f7c5b5f9a7d caffe::Layer<>::SetUp()
@ 0x7f7c5b5e6414 caffe::Net<>::Init()
@ 0x7f7c5b5e456d caffe::Net<>::Net()
@ 0x7f7c5b6bcb3f caffe::Solver<>::InitTrainNet()
@ 0x7f7c5b6bc362 caffe::Solver<>::Init()
@ 0x7f7c5b6bbe48 caffe::Solver<>::Solver()
@ 0x41ba11 caffe::SGDSolver<>::SGDSolver()
@ 0x419391 caffe::GetSolver<>()
@ 0x415053 train()
@ 0x417428 main
@ 0x7f7c5a332ec5 (unknown)
@ 0x413fa9 (unknown)
@ (nil) (unknown)
Aborted (core dumped)

I think my pooling layer causing above error. My cuda is 7.0, cudnn is v3.0, and I have Titan X so my setting should be okay or I might miss something such as path setting or other things. Do I miss something else? I tried to use "ReLU" in this promotion but I could not use it. Why can't I use layer type "ReLU" in this promotion? I could use it in #2442 promotion. Thanks!

@ToruHironaka
Copy link

@WGW101, I solved it by referencing @squall815 above but I still have problems about ReLU layer. Does this promotion support Nd-LRN?

Thanks!

@ToruHironaka
Copy link

@WGW101

I could train my hdf5 datasets with this promotion of caffe but my trainings have never completed so far. Accuracy = 0.5 or less and loss = 1.7 or above. I think my hdf5 datasets or network settings, were wrong. I posted my pyhton scripts for creating hdf dataset and my network setting below. Please help me out.

My python script, which convert images files into hdf5 dataset:

def image2HDF5(inputFile, outputDir, fileType, width, height, channel):

# initialize the total number of files 
# and input file list
numberOfFiles=0
inputFileList=[]
hdfFileList=[]
visualize=False

# open train or test file list with label 
with open(inputFile, 'r') as inputData:
    for fileName in inputData: 
        # this input file list includes label information as well
        inputFileList.append(fileName)  
        numberOfFiles = numberOfFiles + 1 

print "A number of files: ", numberOfFiles

# initialize index 
index=0
fileIndex=0
periodNum=100 # create hdf5 files every 100 file reading cycle

# this loop will open file from inputFileList one by one and put it into
# hdf data files
for dataFileName in inputFileList:

    if (fileIndex % periodNum) == 0:

        # open and create hdf5 file output directory for periodNum file cycle
        outputHDFFile = fileType + "-" + str(fileIndex) + ".h5"
        print "file name: " + outputHDFFile
        outputHDFPath = join(outputDir, outputHDFFile)
        print "hdf5 file: ", outputHDFPath
        fileOut = h5py.File(outputHDFPath, 'w')
        hdfFileList.append(outputHDFPath)


        # set data and label dimensions
        data = fileOut.create_dataset("data", (periodNum,channel,256,256), dtype=np.float32)
        label = fileOut.create_dataset("label", (periodNum,), dtype=np.float32)

        # image data matrix
        imageStack = np.empty((periodNum,channel,256,256)) # Create empty HxWxN array/matrix
        labelStack = np.empty((periodNum))
        # initialize index at every periodNum 
        index=0

    # parse file path and label info from file list line by line
    dataPathandLabel=dataFileName.split(' ', 1)
    dataFilePath=dataPathandLabel[0]
    # print(dataFilePath)
    dataLabel=dataPathandLabel[1]
    # print(dataLabel)
    lastSubDirName=dataFilePath.split('/')
    subDirName=lastSubDirName[-1]
    # print(subDirName)

    labelNumber=int(dataLabel)

    # load image:
    if channel == 1: 
        img=cv2.imread(dataFilePath, cv2.CV_LOAD_IMAGE_GRAYSCALE) # load grayscale
        print 'grayscale: ', img.shape

        imageStack[index,:,:,:]=img
        labelStack[...]=labelNumber

    elif channel == 3:
        img = cv2.imread(dataFilePath, cv2.CV_LOAD_IMAGE_COLOR) # color

        # check the first 5 image file 
        if index < 5 and visualize:
            plt.imshow(img)
            plt.show()

        img = img.transpose(2,1,0)
        # print 'RGB', img.shape

        imageStack[index,:,:,:]=img
        labelStack[...]=labelNumber

    index=index+1
    fileIndex=fileIndex+1

    if (fileIndex % periodNum) == 0:

        # load image data and label information to 
        # hdf5 file for each periodNum cycle
        data[...]=imageStack
        label[...]=labelStack

        # initialize data
        imageStack.__init__()

        # close the file for this cycle
        fileOut.close()
        print 'file close'


# list hdf5 train dataset file list
outputHDFListFile = fileType + '.txt'
outputHDFListPath = join(outputDir, outputHDFListFile)

if exists(outputHDFListPath): 
    outputHDFListFile = fileType + '-list.txt'
    outputHDFListPath = join(outputDir, outputHDFListFile)

print 'list: ', outputHDFListFile
print 'Output dir: ', outputHDFListPath

# hdef file list
with open(outputHDFListPath, 'w') as trainOut:
    for hdfFile in hdfFileList:
        print hdfFile
        writeOut=hdfFile + "\n"
        trainOut.write(writeOut)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants