Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about the training parameters #1

Open
zjcs opened this issue Nov 21, 2020 · 4 comments
Open

Question about the training parameters #1

zjcs opened this issue Nov 21, 2020 · 4 comments

Comments

@zjcs
Copy link

zjcs commented Nov 21, 2020

Hello, @feymanpriv , thank you very much for your work.

I am trying to review your code, here I want to know some key important parameters in your work:

As claimed in delg paper, only 15M step(about 25epoch) is needed in GLDv2Clean-Train-80percent dataset, while the max epoch in your config file is 100, do you train 100 epochs finnally to get your result?

Is there any modification in your implemention which is different to origin implemention by tensorflow ?

@feymanpriv
Copy link
Owner

Yes, we trained 100 epochs. In fact, I have tried the tf implementation, it dit not work well and the batch size could only be set to a small value and the training speed is slow. So i guess some details are different from the author. We used cosine lr in the experiment.

@zjcs
Copy link
Author

zjcs commented Nov 23, 2020

@feymanpriv , thank you for your reply.

Your code is very clear to read, thanks for your work. I still have some questions:

  1. about the learning rate and batchsize: As DELG said, the recommanded hyperparameter is 8 Tesla P100 GPUs: --batch_size=256, --initial_lr=0.01 or 4 Tesla P100 GPUs: --batch_size=128, --initial_lr=0.005, while your config file is 8GPU: --batch_size=64, --initial_lr=0.01 and --batch_size should be 256 due to the implemention in loader.construct_train_loader.

  2. can you tell me more result about the difference of train with 25epochs or 100epochs?

  3. You said, the tf implemention donot work well, do you use the recommand parameters? and what about the result after 25epochs?

  4. what do you mean " the batch size could only be set to a small value"?

@feymanpriv
Copy link
Owner

@zjcs thank you for your attention

  1. I directly use the batch size 256 and lr 0.1 in my training
    2 I haven't tried 25 epochs
    3 When i use the tf, i also train for 100 epochs with the setting the paper mentioned, but the result is not available to achieve the model that google team released
    4 Also, in the tf training, large bs is out of memory while torch is not
    There still exists some questions in this, it will be helpful if you check the code and training.

@zjcs
Copy link
Author

zjcs commented Nov 24, 2020

@feymanpriv

Your reply is very helpful to me, thank you very much, I will try later.

Have a nice day~

@feymanpriv feymanpriv reopened this Dec 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants