Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some help needed in training #247

Open
jarjuk opened this issue Apr 28, 2020 · 3 comments
Open

Some help needed in training #247

jarjuk opened this issue Apr 28, 2020 · 3 comments

Comments

@jarjuk
Copy link

jarjuk commented Apr 28, 2020

First thank you for this nice and clean implementation!

I have created a small repo https://github.com/jarjuk/yolov3-tf2-training documenting, how I have Dockerized your implementation (marcus2002/yolov3-tf2-training:0) and used it to train VOC2012 imageset an Amazon g4dn.xlarge instance.

Training was interrupted twice by "early stopping", and the result performed poorer compared to using original darknet weights.

Could you, please give me some advice, how to achieve better training results?

I have tried to document all the steps in emacs org documents (docker.org and aws.org) in the repository to make it easier to give advice.

BR, Jukka

@Morgensol
Copy link

First thank you for this nice and clean implementation!

I have created a small repo https://github.com/jarjuk/yolov3-tf2-training documenting, how I have Dockerized your implementation (marcus2002/yolov3-tf2-training:0) and used it to train VOC2012 imageset an Amazon g4dn.xlarge instance.

Training was interrupted twice by "early stopping", and the result performed poorer compared to using original darknet weights.

Could you, please give me some advice, how to achieve better training results?

I have tried to document all the steps in emacs org documents (docker.org and aws.org) in the repository to make it easier to give advice.

BR, Jukka

Early stopping is there to help the model not overfitting on data, if it early stops too early for you you can set the patience higher or even comment it out and manually monitor when the validation converges.

For the second part it all depends on what you are using as parameters, like learning rate or what transfer mode you have set, so i think you need to specify that if you need help

@jarjuk
Copy link
Author

jarjuk commented Apr 28, 2020

I have trained voc2012 in two session:

  • session 1 (as documented in yolov3-tf2/docs/training.voc section training/with transfer learning)
  • session 2 (continuing with last checkpoint from session 1, transfer: fine_tune, mode fit)

In both sessions learning rate default value (1e-3).

In python code:

       python train.py \
	--dataset ./voc.data/voc2012_train.tfrecord \
	--val_dataset ./voc.data/voc2012_val.tfrecord \
	--weights ./voc.data/yolov3-cnv.tf \
	--classes ./data/voc2012.names \
	--num_classes 20 \
	--mode fit \
        --transfer darknet \
	--batch_size 16 \
	--epochs 10 \
	--weights_num_classes 80 

And for session 2:

        python train.py \
	 --dataset ./voc.data/voc2012_train.tfrecord \
	 --val_dataset ./voc.data/voc2012_val.tfrecord \
	 --weights ./voc.data/cont_20.tf \
	 --classes ./data/voc2012.names \
	 --num_classes 20 \
	 --mode fit \
         --transfer fine_tune \
	 --batch_size 16 \
	 --epochs 10 \
	 --weights_num_classes 20 

The result is working, but not as well as, when using the original darknet weights. Refer https://github.com/jarjuk/yolov3-tf2-training#detection-results for a comparison

I am novice in DNNs and would like to understand the best strategy to train DNN, and ylov3-tf2 in particular.

@Morgensol
Copy link

Morgensol commented Apr 28, 2020

I notice a couple of things:

  • As far as im aware the "fine_tune" cuts off your last layer, of which would put you back some significant ammount, if you want to transfer learn like that, and resume on a checkpoint i'd recommend looking at How do you resume training? #38 or separate transfer and freeze logic #154.

  • I also assume that the weights you use on your first run is the official weights trained on Imagenet, and the weights you use on your second are the transfered weights from first run.

  • Also another thing, you dont have to specify wights_num_classes in the second run if you did the transfer learning correctly.

  • And lastly you havent specified the image size, are you sure that the images are the size you want them?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants