Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process aborting with "Insufficient data to determine video format" error when fine-tuning #13

Closed
think-high opened this issue Jun 12, 2018 · 9 comments

Comments

@think-high
Copy link

think-high commented Jun 12, 2018

Hi, I am trying to fine tune the pre-trained(Kinetics) R2Plus1D model on my dataset. I created the train and test LMDB of my dataset like this:

#Creating Training LMDB
python /home/rahul/R2Plus1D/data/create_video_db.py --list_file=/home/rahul/Dataset/train_test_lists/train_list.csv --output_file=/home/rahul/Dataset/train_test_lists/LMDB_Training --use_list=1

#Creating testing LMDB
python /home/rahul/R2Plus1D/data/create_video_db.py --list_file=/home/rahul/Dataset/train_test_lists/test_list.csv --output_file=/home/rahul/Dataset/train_test_lists/LMDB_Testing --use_list=1

The format of the csv files is:
org_video("the path of videos"),label("integer label of the video")

And then I run this to train the model:

python /home/rahul/R2Plus1D/tools/train_net.py \
--train_data=/home/rahul/Dataset/train_test_lists/LMDB_Training \
--test_data=/home/rahul/Dataset/train_test_lists/LMDB_Testing \
--model_name=r2plus1d --model_depth=18 \
--clip_length_rgb=16 --batch_size=4 \
--pretrained_model=/home/rahul/R2Plus1D/pre-trained-models/r2.5d_d18_l16.pkl \
--db_type='pickle' --is_checkpoint=0 \
--gpus=0,1 --base_learning_rate=0.0002 \
--epoch_size=40000 --num_epochs=8 --step_epoch=2 \
--weight_decay=0.005 --num_labels=14

But the process is getting aborted with these errors:

E0612 23:00:22.106964  9409 video_decoder.cc:75] Insufficient data to determine video format
E0612 23:00:22.107067  9411 video_decoder.cc:75] Insufficient data to determine video format
E0612 23:00:22.106992  9412 video_decoder.cc:75] Insufficient data to determine video format
E0612 23:00:22.106918  9407 video_decoder.cc:75] Insufficient data to determine video format
E0612 23:00:22.107008  9413 video_decoder.cc:75] Insufficient data to determine video format
E0612 23:00:22.107035  9414 video_decoder.cc:75] Insufficient data to determine video format
E0612 23:00:22.106915  9408 video_decoder.cc:75] Insufficient data to determine video format
E0612 23:00:23.469903  9409 video_decoder.cc:75] Insufficient data to determine video format
E0612 23:00:23.469926  9411 video_decoder.cc:75] Insufficient data to determine video format
E0612 23:00:23.469921  9413 video_decoder.cc:75] Insufficient data to determine video format
E0612 23:00:23.469907  9410 video_decoder.cc:75] Insufficient data to determine video format
E0612 23:00:23.469923  9412 video_decoder.cc:75] Insufficient data to determine video format
E0612 23:00:23.469954  9414 video_decoder.cc:75] Insufficient data to determine video format
E0612 23:00:23.469923  9407 video_decoder.cc:75] Insufficient data to determine video format
E0612 23:00:23.469995  9408 video_decoder.cc:75] Insufficient data to determine video format

..........
..........
.........
.........
.........

 Encountered CUDA error: device-side assert triggered Error from operator:
input: "gpu_0/comp_4_spatbn_1" input: "gpu_0/comp_4_conv_2_middle_w" input: "gpu_0/__m1_shared" output: "gpu_0/comp_4_conv_2_middle_w_grad" output: "gpu_0/__m2_shared" name: "" type: "ConvGradient" arg { name: "no_bias" i: 1 } arg { name: "kernels" ints: 1 ints: 3 ints: 3 } arg { name: "ws_nbytes_limit" i: 67108864 } arg { name: "exhaustive_search" i: 1 } arg { name: "strides" ints: 1 ints: 1 ints: 1 } arg { name: "pads" ints: 0 ints: 1 ints: 1 ints: 0 ints: 1 ints: 1 } arg { name: "order" s: "NCHW" } device_option { device_type: 1 cuda_gpu_id: 0 } engine: "CUDNN" is_gradient_op: true
E0612 23:00:23.986739  9419 net_dag.cc:195] Secondary exception from operator chain starting at '' (type 'SoftmaxWithLoss'): caffe2::EnforceNotMet: [enforce fail at context_gpu.h:156] . Encountered CUDA error: device-side assert triggered Error from operator:
input: "gpu_1/last_out_L14" input: "gpu_1/label" output: "gpu_1/softmax" output: "gpu_1/loss" name: "" type: "SoftmaxWithLoss" device_option { device_type: 1 cuda_gpu_id: 1 }
F0612 23:00:23.990535  9416 context_gpu.h:107] Check failed: error == cudaSuccess device-side assert triggered
*** Check failure stack trace: ***
F0612 23:00:23.990537  9418 context_gpu.h:107] Check failed: error == cudaSuccess device-side assert triggeredF0612 23:00:23.990561  9420 context_gpu.h:107] Check failed: error == cudaSuccess device-side assert triggeredF0612 23:00:23.990689  9422 context_gpu.h:107] Check failed: error == cudaSuccess device-side assert triggeredF0612 23:00:23.990710  9421 context_gpu.h:107] Check failed: error == cudaSuccess device-side assert triggeredF0612 23:00:23.990717  9417 context_gpu.h:107] Check failed: error == cudaSuccess device-side assert triggeredF0612 23:00:23.991060  9419 context_gpu.h:107] Check failed: error == cudaSuccess device-side assert triggered
*** Check failure stack trace: ***
F0612 23:00:23.990537  9418 context_gpu.h:107] Check failed: error == cudaSuccess device-side assert triggeredF0612 23:00:23.990561  9420 context_gpu.h:107] Check failed: error == cudaSuccess device-side assert triggeredF0612 23:00:23.990689  9422 context_gpu.h:107] Check failed: error == cudaSuccess device-side assert triggeredF0612 23:00:23.990710  9421 context_gpu.h:107] Check failed: error == cudaSuccess device-side assert triggeredF0612 23:00:23.990717  9417 context_gpu.h:107] Check failed: error == cudaSuccess device-side assert triggeredF0612 23:00:23.991060  9419 context_gpu.h:107] Check failed: error == cudaSuccess device-side assert triggered
MyScripts/trainWithPretrainedModel.sh: line 10:  9358 Aborted 

Here is what stdout dump that I passed to a file:

Ignoring @/caffe2/caffe2/contrib/nccl:nccl_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops_gpu as it is not a valid file

Can you please help me out of this?

Thanks,
Rahul Bhojwani

@dutran
Copy link
Contributor

dutran commented Jun 13, 2018

Looks like you failed at "SoftmaxWithLoss" layer. In your command, you have --num_labels=89, the layer name at the error is "gpu_1/last_out_L14", that mean you prediction layer has only 14 outputs at softmax, but you label goes out of scope e.g. label >=14. Then you hit this error. Check careful why you input --num_labels=89 why the net still create the last layer with 14 outputs.

@murilovarges
Copy link

The integer label of the video must start from 0, here happened some errors when I started from 1.

@think-high
Copy link
Author

@dutran : Actually the --num_labels was set to 14 in the script I ran. I just copied the other script here by mistake. My bad. Will correct it now.

@murilovarges : I actually have my labels starting from 1. Let me try this.

@dutran
Copy link
Contributor

dutran commented Jun 13, 2018

Then, changing your labels to start from 0 as @murilovarges suggested will solve your issue.

@think-high
Copy link
Author

think-high commented Jun 13, 2018

Hey. So, I did that and the process is not getting aborted and the training is running. But these warnings are still constantly being generated:

E0612 23:00:22.106964  9409 video_decoder.cc:75] Insufficient data to determine video format
E0612 23:00:22.107067  9411 video_decoder.cc:75] Insufficient data to determine video format
E0612 23:00:22.106992  9412 video_decoder.cc:75] Insufficient data to determine video format
E0612 23:00:22.106918  9407 video_decoder.cc:75] Insufficient data to determine video format
E0612 23:00:22.107008  9413 video_decoder.cc:75] Insufficient data to determine video format
E0612 23:00:22.107035  9414 video_decoder.cc:75] Insufficient data to determine video format

Can you help me understand the reason and whether it is a problem or not?

Thanks

@dutran
Copy link
Contributor

dutran commented Jun 13, 2018

It looks like, your video does not have some meta data, decoder does not know how to decode.

@dutran dutran closed this as completed Jun 13, 2018
@murilovarges
Copy link

This message is in https://github.com/pytorch/pytorch/blob/master/caffe2/video/video_decoder.cc#L75.

I guess can happen when the video is very small too.

@murilovarges
Copy link

Similar issue in facebookresearch/video-nonlocal-net#12

@think-high
Copy link
Author

Oh great. This is very helpful. Thanks a lot @dutran and @murilovarges . 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants