Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have you met memory leak problem when running model? #9

Closed
Sucran opened this issue Oct 8, 2018 · 74 comments
Closed

Have you met memory leak problem when running model? #9

Sucran opened this issue Oct 8, 2018 · 74 comments

Comments

@Sucran
Copy link

Sucran commented Oct 8, 2018

Hi, @Ugness
I met a RAM memory leak problem when running network.py and train.py, this issue confused me for a few days. I have run other pytorch repo which is OK.
I run the code in Ubuntu 14.04, Pytorch 0.4.1, CUDA 8.0, cudnn 6.0.

@Ugness
Copy link
Owner

Ugness commented Oct 8, 2018

Yes I also met memory leak problem.
In my case, my GPU's VRAM is 11 GB, and It spends about 9GB of VRAM for batch size 1. (in paper, batchsize 10 is recommended but I could not train the model with that option)
(for 3 * 3 convolution version)

May be 1 * 1 convolution version(branch adjusted) would spend less memory.

I think that memory leak comes from local Picanet's implementation. It makes H * W * C Tensor to H * W number of 14 * 14 * C size patches. If you have better idea to implement local Picanet, please comment here or make a pull request.

Since I am not the author of paper, this code is not the best implementation. I'm sorry for that.

@Sucran
Copy link
Author

Sucran commented Oct 8, 2018

Yes, I noticed the batch size option, it is wired and strange. I have no better idea so far, but I hope for further discussion. This week, I will go through the caffe code from author and see the difference in implementation in pytorch and caffe, going deeper in local PiCANet and global PiCANet.

@Ugness
Copy link
Owner

Ugness commented Oct 8, 2018

Can you give me the link of caffe implementation? I didn't know that. Thanks.

@Sucran
Copy link
Author

Sucran commented Oct 9, 2018

@Ugness https://github.com/nian-liu/PiCANet, with deeplab caffe version.

@Ugness
Copy link
Owner

Ugness commented Oct 9, 2018

Thanks a lot.

@Sucran
Copy link
Author

Sucran commented Oct 9, 2018

@Ugness I change the PiCANet config from ‘GGLLL’ to ‘GGGGG’ and ‘LLLLL’, both of them have memory leak problem when running network.py, have you met this before? I also found an interesting implementation of authors caffe code, they seem implemented an attpooling function on their own proto cpp which support their global or local attention function like conv3d. Can you give me a hint on how you thinking about the conv3d processing?

@Ugness
Copy link
Owner

Ugness commented Oct 9, 2018

I think that would not work with 'GGGGG' or 'LLLLL'. I just tested with 'GGLLL' and other options may cause some tensor dimension error.

And I will check protocpp ASAP.
For my conv3d processing, it is not easy to describe with text only. :(
I will describe with text first, but if you need more information to help your understanding, I'll make some images ASAP.

@Ugness
Copy link
Owner

Ugness commented Oct 9, 2018

How Conv3d works?

Assumption

  • Lets say Image_batch's shape is (N x C x H x W).
    and attention map( of each pixel position)'s shape is (h x w) (for global, h=H, w=W).
  • We have (H * W) number of attention maps.
    Each attention map should be applied to each pixel's patch, every channel.
  • I think to make this process with for loop takes a lot of time
    (I don't know how to use CUDA level for loop, and I heard that default for loop works on cpu),
    so I tried to use pytorch's pre-implemented Convolution functions.

What's difference between convolution and 'PiCA' process?

  • Convolution applies same kernel to each patch(pixel location), but different kernels to each channel, sample(batch_size).
  • PiCA process applies same att_map to each channel, but different att_map to each patch(pixel location), sample(batch_size).

PiCA process with Conv3d (Main Idea of method)

  • On image side, my idea is send the dimension of batch and location to dim:1 (channel).

(1, NxHxW, C, 13, 13) -> for F.conv3d, each dimension means (batch, channel, depth, H, W)

  • On kernel side, my idea is each (1,1,7,7) kernel goes to (1,1,13,13) by using F.conv3d dilation option.

  • Then, F.conv3d will apply NxHxW number of kernels to NxHxW number of patchs.
    It is possible by using groups option

  • Also, F.conv3d will across the depth dimension(C, dim:2) with same att_map.

  • Finally, the output is (1, NxHxW, C, 1, 1) attention applied feature map, so I can reshape it to (N, C, H, W)

I used same idea to local PiCANet.

@Ugness
Copy link
Owner

Ugness commented Oct 9, 2018

  • For conventional Pytorch's Conv3d
    image
  • My use of Conv3d
    image

@Ugness
Copy link
Owner

Ugness commented Oct 9, 2018

X_X
I thought Caffe is simillar to pytorch, but it wasn't.
I tried to read the code, but I can't. The only thing I can see is they used for loop.
If they used for loop for implementing PiCANet, for loop in python consumes a lot of time. without CUDA logic. And I don't know how to use CUDA for loop in python. T.T

@Sucran
Copy link
Author

Sucran commented Oct 10, 2018

@Ugness I do not think they use loop for implementing PiCANet. They use im2col and col2im, which is torch.nn.Unfold and torch.nn.Fold in pytorch. I suppose Conv3d can be translated into a combination of several im2col + matric multiplication + col2im, but I still confused how to implement this, still working on it.
The memory leak problem we suffered seems caused by F.Conv3d, hoping next version would fix it.

@Ugness
Copy link
Owner

Ugness commented Oct 10, 2018

Thanks. I also try to convert conv3d operation to combination of matrix multiplication.

@Ugness Ugness closed this as completed Oct 10, 2018
@Ugness Ugness reopened this Oct 10, 2018
@Ugness
Copy link
Owner

Ugness commented Oct 14, 2018

@Sucran I think I can improve my model soon. There was no such function like torch.nn.Fold on pytorch 0.4.0 when I started this project. Now, I found the function that I need. Thanks.

@Sucran
Copy link
Author

Sucran commented Oct 14, 2018

Oh, really?Amazing! @Ugness You are such a genius.
Looking forward to your new version. Thanks for your work, again.

@Ugness
Copy link
Owner

Ugness commented Oct 14, 2018

Hi @Sucran I made a new logic!
You can check it on https://github.com/Ugness/PiCANet-Implementation/tree/Fold_Unfold
Now you can train PiCANet model with batch1, by using 3.5GB of VRAM.
I just started my training code, so I'll report the training result about next week!

Looks like it works!
image

image

@Sucran
Copy link
Author

Sucran commented Oct 15, 2018

@Ugness Soooooo happy for it works! I check the branch of Fold_Unfold, the memory leak problem seems gone. The VRAM is also lower for increasing the batch size, but cannot be 10. I will check the channel setting of each layer by comparing the caffe version of the author, maybe there is something misunderstanding still exits.

@Ugness
Copy link
Owner

Ugness commented Oct 15, 2018

@Sucran Thanks a lot for your interest. It gave a lot of improvement. It seems like training speed is also improved.
About the version of code, Fold_Unfold version is branch of origin (33 conv) not the Adjusted(11 conv) one.
I am training this code with 3*3 conv, batch_size 4.
I am going to close this Issue after report the result. If you find some errors or need help, please open another issue. :)

@Sucran
Copy link
Author

Sucran commented Oct 15, 2018

@Ugness Ok. Thanks for your work again. It is my pleasure.

@Sucran
Copy link
Author

Sucran commented Oct 19, 2018

@Ugness Anything new?

@Ugness
Copy link
Owner

Ugness commented Oct 19, 2018

One of my model got about 88 on F-measure score with 200 samples of DUTS-TE which scored 87 with model in paper, So I am measuring score with all of DUTS-TE, on all of checkpoints. So it takes a little bit long time.

I ensure that new model(with bigger batch_size) performs much better.
I think I can update repo on Sunday or next Monday.

@Ugness
Copy link
Owner

Ugness commented Oct 21, 2018

I updated and merged branch.

@Sucran
Copy link
Author

Sucran commented Oct 22, 2018

@Ugness So the result is the branch of origin (33 conv) not the Adjusted(11 conv) one? it seems to increase the performance of the author's version? The curve you plot is corresponding to training or validation?

@Ugness
Copy link
Owner

Ugness commented Oct 22, 2018

No, it's adjusted one. I used (1*1 conv).
Yes it seems making better performance. The curve is validation.

I think I need to check all of the code hardly. May be there is something wrong.

@Sucran
Copy link
Author

Sucran commented Oct 27, 2018

@Ugness Hi, I try to reproduce your result, but I am confusing how to compute the metric result you reported. I had a trained weight model, but which code file contains the test part code?

@Ugness
Copy link
Owner

Ugness commented Oct 28, 2018

You can check the measuring code in pytorch/measure_test.py. It will report the result on tensorboard, and you can download csv from tensorboard.

@Sucran
Copy link
Author

Sucran commented Oct 30, 2018

Hi @Ugness, do you check your test code for computing Max F_b and MAE, I think there are problems here.

  1. The way of computing MAE which is different with MSE_Loss. It is torch.abs(y_pred - y).mean().
  2. I do not familiar with scikit-learn, maybe the pr_curve computing is more efficient than handcraft one. but I got a different result, I ref the code of AceCoooool/DSS-pytorch, I think the problem can be here.
    I using the trained model 36epo_383000step.ckpt and got a Max F_b as 0.877 for your code, but got 0.750 for AceCoooool's code.

@Ugness
Copy link
Owner

Ugness commented Oct 30, 2018

  1. Ops. I found that MSE and MAE is not same. It's my mistake. I'll fix it.
  2. I'll check how scipy measures F-beta score.
    I used threshold to measure F-beta score, may be that was wrong.

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_curve.html#sklearn.metrics.precision_recall_curve

For example, if threshold=0.7 and predicted value=0.8, I made 0.8 to 1. As like as making PR-curve.
And I plot Max-F score on all threshold space.
If I did not use threshold, maybe it scored 0.75 as like you get.
Thank you for your comment. I also think 0.877 is strange. Because my attention map was different from author's one.

@Sucran
Copy link
Author

Sucran commented Oct 31, 2018

@Ugness I do not think the scikit-learn API provided a correct way to compute max F-beta, but you can ref the paper "Salient Object Detection: A Survey" for Chapter 3.2. Usually, we have a fixed threshold which changes from 0 to 255, for binarizing the saliency map to compute precision and recall. F-beta is computed from the average precision and average recall of all images. Then we pick the maximum as max F-beta.

@Ugness
Copy link
Owner

Ugness commented Nov 20, 2018

And about your memory problem, how much VRAM and RAM do you have?
Where does the RAM problem occur? RAM or VRAM?

@Sucran
Copy link
Author

Sucran commented Nov 20, 2018

@Ugness I think the F-score procedure code that you showed is correct. It is almost the same as I reported in 7 days ago, right? I just set the number of threshold as 100 and you set it as 256, which no cause too many differences, but the result still be 0.854 when you tested?
The most strange thing hit on my mind is the difference of MAE results. I always got a value of 0.65 but you got 0.54, we test it with the same code. Oh man, it is wired!

@Ugness
Copy link
Owner

Ugness commented Nov 21, 2018

I also think it is strange. And I have a few questions to compare our results.

  1. Did you use all of DUTS-TE for testing?
  2. How many images in your DUTS-TE folder? There was a few mismatching files which should be deleted in DUTS-TE.
  3. Can you give me the threshold value which you used for the score 0.7803?
  4. Did you explicitly round (convert to binary image) the mask images in DUTS-TE?
  5. Did you use GPU for testing? (Did you use cuda mode with a single GPU?)
    Thanks for sharing your results. It improves this project a lot.

@Sucran
Copy link
Author

Sucran commented Nov 21, 2018

  1. all DUTS-TE
  2. I found this problem but these files do not cause a big difference.
  3. Nope, I have no print this threshold. I will test it again.
  4. Convert the mask images into range [0,1], this is done automatically.
  5. Yes, single GPU for testing, cuda mode.

@Ugness
Copy link
Owner

Ugness commented Nov 21, 2018

Thank you for answering.
for 3., If you want, can you test with my threshold option? I already have it. It is 0.6627.

@Sucran
Copy link
Author

Sucran commented Nov 23, 2018

@Ugness Sorry, the threshold I test is 0.8. I have no test your option yet, I need to wait for any available GPU in my lab.
I also have a problem, can you test the MAE result on all DUTS-TE without modifying the dataset? The difference of MAE result confused me.

@Ugness
Copy link
Owner

Ugness commented Nov 24, 2018

What do you mean by without modifying the dataset?
I am going to upload all the result (MAE, F-measure, threshold) on google drive.
I also upload the list of image file names in my DUTS-TE Dataset with it.

@Sucran
Copy link
Author

Sucran commented Nov 28, 2018

@Ugness. I mean it should be 5019 images in DUTS-TE without deleting mismatching files, you should test on all 5019 images.

@Ugness
Copy link
Owner

Ugness commented Nov 28, 2018

While DUTS-TE-Mask has 2 more images than DUTS-TE-Image?
My DUTS-TE-Image folder has 5019 images. I deleted 2 images from DUTE-TE-Mask because there was 5021 images.

@RaoHaobo
Copy link

RaoHaobo commented Apr 2, 2019

Hi, @Ugness , I intergrate your flie of measure.py and train.py ,but I don't change the file of network.py . now , I set the value of batch_size is 2, at the first falling of learning rate,my train loss can falling .but after that,althought my learning rate falling ,my train loss never falling. And , I test my model on PASCAL-S ,the best value of MAE is 0.1243.could you help me and sovle this problem?

@Ugness
Copy link
Owner

Ugness commented Apr 3, 2019

@RaoHaobo Can you give me some captures of your loss graph?? You may found it on Tensorboard.
And I think it is better to make new issue. Thanks.

@RaoHaobo
Copy link

RaoHaobo commented Apr 3, 2019

I change the learning rate decrease by per 15000, but the case of train loss never falling is the same of your 7000. I results of train loss and learning rate as follows. Thanks!

image

@Ugness
Copy link
Owner

Ugness commented Apr 3, 2019

I think that graph looks fine.

But if you think that loss should be more less, I recommend you to increase lr decay rate and lr decay step. The hyper parameters on my code, I just followed the implementation on PiCANet paper with DUTS Dataset.
And about MAE, it may related to batch size. When I changed batchsize 1 to 10(may be 4 I do not remember correctly), the performance was incrementally increased.

I’ll let you know the specific value of score when I found the past results.
And I’ll also upload the loss graph of my experiment as well. Thanks.

@RaoHaobo
Copy link

RaoHaobo commented Apr 4, 2019

I make the Ir decay rate from 0.1 to 0.5,the Ir decay steps is 7000.And my loss as follows
image
image

Why the train loss never falling after one opeoch?Do you meet the problem?

@ghost
Copy link

ghost commented Apr 9, 2019

nice work and nice code!
When I run 'python train.py --dataset ./DUTS-TR', there occurs a error:(it seems something wrong with tensorboardX, but i have no idea what to do):
image
thanks for your reply~

@RaoHaobo
Copy link

RaoHaobo commented Apr 9, 2019

@dylanqyuan version of your tensorboardX is too high.

@ghost
Copy link

ghost commented Apr 9, 2019

@dylanqyuan version of your tensorboardX is too high.

It works! Thank you buddy!

@Ugness
Copy link
Owner

Ugness commented Apr 9, 2019

@RaoHaobo #16 (comment)
I’ve uploaded my graph on that link. And I also suggest you to follow that links 3 steps to check if model is trained or not.

My graph is also fluctuating as like as yours, and looks it is not decreasing.
And for your graph, I am concerned about the learning rate. I think it became too small to train the model effectively after 1 epoch. But I did not had such experiment about that, so it’s just my personal opinion.

If you want to check your models performance, I suggest you to follow the steps on the link.
If you are worrying about non-decreasing training loss, I suggest you (and also I) to have more experiments with learning rate and the other hyperparameters.
In detail,

  • make lr decay rate between 0.9~1 or decay step much larger.
  • Please carefully observe that if the model is trained enoughly on that lr.
    Thank you for your interest and I hope you to have much higher performance than my experiments! :)

p.s. please comment at #17 if you want to talk about this issue more. To make easy to find!

@RaoHaobo
Copy link

RaoHaobo commented May 5, 2019

@Ugness I test your the '36epo_383000step.ckpt' on PASCAL-S ,and the result is
image,but your result is
image,
why?
another problem:I add some my ideal on your code ,and I train my model is well:
image
but when I test my model on your this code :measure_test.py ,the test result is
image

@RaoHaobo
Copy link

RaoHaobo commented May 5, 2019

@Ugness the second problem have been solved ,the first isn't sovled

@Ugness
Copy link
Owner

Ugness commented May 7, 2019

Sorry. I forgot to mention that all of my experiment results are on DUTS dataset only. I updated my readme file.
If you got my result from the README.md, I trained and tested the model on ONLY DUTS Dataset. So the result on PASCAL-S dataset may differ.

@RaoHaobo
Copy link

@Ugness Ok,
image
this is your trained model ,and I use it to test on PASCAL-S and SOD ,the max_F is 0.8379.Could you test your model on other dataset?

@RaoHaobo
Copy link

@Ugness this code on your measure_test.py .
image
but the github.com/AceCoooool/DSS-pytorch solver.py is
image ,I think they have big different.

@Ugness
Copy link
Owner

Ugness commented May 13, 2019

I've made that .sum(dim=-1) because my code evaluates several images on parallel.
github.com/AceCoooool/DSS-pytorch solver.py calculates the prec / recall on a single image at once, and my code calculates all images at once.
The whole dimension of y_temp and mask is (batch, threshold_values, H, W).
If I execute .sum() like as below code, it would sum all values in y_temp. Although we should sum only on H and W axis.
And for the 1e-10, I made them for avoiding division by zero problem.
If you think my explanation is wrong, please give me your advice. Thanks.

@RaoHaobo
Copy link

@Ugness i mean that tp + 1e-10,the 1e-10 maybe take out ,I try to take out it ,but the max_F falling much.
I also use the Dss code to test your model on DUTS-TE,the result is bad.

@Ugness
Copy link
Owner

Ugness commented May 16, 2019

How much difference that follows from the error?
Is the difference significant?
Let me know it. Thanks.

@RaoHaobo
Copy link

when threhold equal to 1,the prec must be 0,but your result equal to 1

@RaoHaobo
Copy link

@Ugness
The writer.add_pr_curve() function in measure.py can't work , It never show in the tensorboard . I think it caused by the version tensorboard.
c58dafe5b088c12bd987229dc5cb70f

@Ugness
Copy link
Owner

Ugness commented Aug 1, 2019

https://github.com/tensorflow/tensorboard/releases
Would you try it with the tensorboard 1.8.0???

@Ugness Ugness closed this as completed Jul 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants