-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The accuracy of evaluation cannot increase when training imagenet #59
Comments
A Test score of 0.001 means that the network is guessing random, that is not learning anything. Shuffling the training data is important, otherwise it could be that each batch only get images of the same class. In general the Test score should be above 0.01 after the first aprox the 5000 iterations. So if the loss doesn't decrease and the Test score increase, that is a sign that your network is not learning. You should check your prototxt files and your level-db. |
Train on the Cifar dataset to investigate the effect of shuffling and not shuffling. |
Thanks, @sguada and @kloudkl ! Sorry that I did not make the question clear. In Yangqing's instruction, he did not do the shuffling. Thus, I mean, I followed everything in his instruction, and also did the shuffling when constructing the leveldb. I have already checked the proto files with the Alex's paper, and almost sure that the configuration is correct. Now, I am checking the code for constructing the leveldb, which I think is the only difference between "mnist demo" and "imagenet demo". BTW, could you tell me the size of imagenet training data stored by leveldb? In my case, it is about 236.2G, which is strange as the size of original images is about 60G. |
The |
@mianba120 , your problem may not have been caused by the order of the data. The ImageNet dataset is really huge and not suitable for debugging if you are using all of images. Successful training can be seen in comments of #33. template <typename Dtype>
void Solver<Dtype>::Solve(const char* resume_file) {
...
while (iter_++ < param_.max_iter()) {
...
} // while (iter_++ < param_.max_iter())
} // void Solver<Dtype>::Solve(const char* resume_file) the iterations of different epochs are not separated. In caffe/proto/caffe.proto, there is no definition of epochs
Therefore, setting the max_iter entails computing expected_epochs * iterations_per_epoch which is a little inconvenient and indeed produced an error in the original imagenet.prototxt (again see comments of #33). If the max_iter % iterations_per_epoch != 0, I am afraid that the last partial epoch consisted of max_iter % iterations_per_epoch iterations would introduce bias into the training dataset. Although the typo has been fixed in commit b31b316, it suggested us that a better design woud be to set max_epoch and let the iterations_per_epoch = ceil(data_size / data_size_per_minibatch). Then in caffe/solver.cpp we will have the chance to shuffle data before each epoch to make the gradients more random and accelerate the optimization process. template <typename Dtype>
void Solver<Dtype>::Solve(const char* resume_file) {
...
while (epoch_++ < param_.max_epoch()) {
PreEpoch(...); // Shuffle data and some other stuff
for (size_t i = 0; i < iterations_per_epoch; ++i) {
iter_++;
...
} // for (size_t i = 0; i < iterations_per_epoch; ++i)
...
} // while (epoch++ < param_.max_epoch())
} // void Solver<Dtype>::Solve(const char* resume_file) After the change, it is no longer necessary to remember to do shuffling in each example recipe or in any other application. |
@kloudkl , thanks so much for your help. I will check the discussion in #33. As you have pointed out that: "If the max_iter % iterations_per_epoch != 0, I am afraid that the last partial epoch consisted of max_iter % iterations_per_epoch iterations would introduce bias into the training dataset.". I think some code should also be refined at the file examples/convert_imageset.cpp, from line 82: int main(int argc, char** argv) { Here, I think he drops the training images from 1,281,001 - 1,281,167, as I cannot find anyplace to write the last batch which contains those images. In your second part, I think it may be better if we can separate the "epoch loop" and "batch loop", however, the PreEpoch() may be time-consuming, which I suppose should be done offline. Also, could you tell me the file size of imagenet training data stored by leveldb? I need to make sure the correctness of my training data. :) |
The convert_imageset.cpp does contain an error that should be fixed. Thanks Imagenet files stored in the leveldb are 256x256. The images are IMHO, I am not very keen on the per-epoch random shuffling, mainly due to (1) Epochs are simply a notion we use to track the progress of training, (2) Per-epoch random shuffling of a leveldb will really hurt speed, since a (3) Do a one-time shuffling and then go sequentially through all the data Yangqing On Mon, Jan 27, 2014 at 12:52 AM, mianba120 [email protected]:
|
I may have exactly the same problem with @mianba120 described. I followed the new version of imagenet training recipe, with the Caffe package updated on Feb. 10. I also did shuffling with 'convert_imageset'. The current log: I am using Ubuntu 12.04, K20. The mnist demo works well with my Caffe setup. I am wondering what the problem is. For the training and validation images, I just converted them into 256x256 as the simplest way. Has anyone succeed with training imagenet by resizing like this? Should I do the same as“The images are reshaped so that the shorter side has length 256, and the centre 256x256 part is cropped for training”, as mentioned in http://decaf.berkeleyvision.org/about ? |
hi @palmforest , this is mianba120 (I have changed my name...). I have fixed this problem accidentally, though I have no idea on how to give you the correct answer. However, I found from my successful case on imagenet:
Overall, I think you may try:
Anyway, if you want, I can send you the log of imagenet. Also, you may try my branch.... (not an advertisement...) Good luck! |
Hi @huangjunshi Many thanks for sharing your view and experience, I am using the original Caffe codes and have tried to re-run the training 3 times...but, the testing scores remain 0.001 after 5000 iterations...I am still trying for good lucks...:) Could you please share your log on training imagenet? My email address is [email protected] I am also curious about bad/good random initialization. Theoretically, the random initialization should not lead to the problem like this, has anyone else met this problem with Caffe? |
Hi@palmforest,I have met the same problem now,and I think you must have solved it,I will really appreciate it if you can share your wisdom! |
Hi huangjunshi,I have met the same problem now,and I think you must have solved it,I will really appreciate it if you can share your wisdom! |
Hi @niuchuang , basically, this problem just disappeared after several trials without many modification, and it never happens again in last year.... The only thing I can remember is that the initial value of bias is changed into 0.7 (or even 0.5) for all the layers if it's 1 originally. Usually, the loss should drop below 6.90 after about 2,000 - 3,000 iterations (batch size is 256). Another observation which should be helpful is that the mean of gradient of loss w.r.t. weights/bias in FC8 layer should be 10^-5 - 10^-6 (You may write this part by yourself). If it is less than 10^-6, such as 10^-7, this usually leads to a bad solution, or even cannot converge. |
In my case, the first iteration # for below 6.9 was 4,500, however, it drops rapidly after that. |
HI ,Kim I met same problem as you describe in your mail, did you finally solve it? |
@stevenluzheng Let it go few hours. I didn't nothing but after 2 days, it is getting over 50% accuracy now. |
Thanks Kim Actually,I have done 25K iterations from yesterday, and accuracy still remains 0.39, I use my own data set to train caffe, it seems training fails for some uncertain reasons. BTW do you use your own dataset to train caffe? if you use your own dataset, so how many pictures you use in training set and val set respectively? |
@stevenluzheng I used ilsvrc12 dataset which has 1,281,167 images for training. I heard that it'll take 6 days to get a sufficient accuracy. |
Oh....I think you might use ImageNet dataset to train your caffee, this is a magnitude dataset , I guess huge dataset can train caffe network sufficiently, small dataset might not cause deep learning network convergence, I only use 300 pictures in training process and 40 pictures in val process... Did you ever use you own data set to train and val caffee before? |
Hi guys I have the same problem, but in my case loss = 8.5177, I'm trying use LFW dataset to train my net, for this I write my own file.prototxt, and I follow paper of Guosheng Hu for write the architecture. Someone have any idea? |
acpn: |
Hi, thanks stevenluzheng, but i'm trying reproduce results of this paper: http://arxiv.org/abs/1504.02351. |
Properly set GPU arena in test and time functions
I had a similar problem as described here but in my case the root cause was that i was using labels (in the filename and labels .txt file) starting at 1 instead of 0. |
BN implementation from cuDNN has no accuracy problem, at least for cuDNN5. I trained resnet-50 with a top-1 accuracy of 75%. |
What @huangjunshi was saying basically did it for me, but I changed every bias=1 to bias=0.1, but how are we supposed to know this from the very incomplete documentation caffe's website has. The tutorials are meant for people who know nothing about caffe and getting into deep learning and yet they leave a lot of things out, and this bias thing is their own mistake, did they even test the tutorials before publishing them? |
Hi all , |
I have strictly followed the instruction of Yangqing in his webpage (http://caffe.berkeleyvision.org/imagenet.html) to train the imagenet (including using the .proto files provided by Yangqing) as well as shuffling the training data, while the accuracy on evaluation data stays at 0.001 even now it is 77,920 iterations. Here is the current output:
I0127 08:53:57.624028 37204 solver.cpp:210] Iteration 78000, lr = 0.01
I0127 08:53:57.633610 37204 solver.cpp:68] Iteration 78000, loss = 6.9063
I0127 08:53:57.633633 37204 solver.cpp:90] Testing net
I0127 08:56:01.357560 37204 solver.cpp:117] Test score # 0: 0.001
I0127 08:56:01.357609 37204 solver.cpp:117] Test score # 1: 6.90977
I0127 08:56:33.533275 37204 solver.cpp:210] Iteration 78020, lr = 0.01
I0127 08:56:33.542655 37204 solver.cpp:68] Iteration 78020, loss = 6.90727
I0127 08:57:05.939363 37204 solver.cpp:210] Iteration 78040, lr = 0.01
I0127 08:57:05.948905 37204 solver.cpp:68] Iteration 78040, loss = 6.9073
To find out the reason, I have run the code on mnist, and got the almost correct result. Also, I have sampled some images both from training dataset and evaluation dataset. The labels and images are both OK.
Does anyone have such situation before? Can anyone give me some help on how to solve this problem?
My environment is Ubuntu 13.10 with GTX Titan, CUDA 5.5.
The text was updated successfully, but these errors were encountered: