Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quickstart example runs for a long time #998

Open
cspink opened this issue Mar 5, 2023 · 5 comments
Open

Quickstart example runs for a long time #998

cspink opened this issue Mar 5, 2023 · 5 comments

Comments

@cspink
Copy link

cspink commented Mar 5, 2023

I am new to OpenNMT-tf, as the features look very useful for what I am trying to do. I begin with the quickstart guide found here. Installing the software on a HPC grid with its idiosyncrasies can be cumbersome, and I am not entirely sure if I have done things correctly. The system runs, however, but even when I start the example, running the exact same commands as in the quickstart guide, on a machine with 4 GPUs, it runs for hours, possibly days. The throughput is ~50k tokens/s. The guide also says that it is too little data to get good results. So what will the training times be on real data, then?

Since this is described as a toy example I found it odd that it might run for days on such a big system. So I would just like to know something about what I can expect from the autoconfig system described in the guide. I also have no idea of when it will stop, if it has some convergence criteria or a fixed amount of epochs.

The vocabulary I get out is:
24998 toy-ende/src-vocab.txt
35819 toy-ende/tgt-vocab.txt

@guillaumekln
Copy link
Contributor

By default, the training will run for 500,000 steps.

For the quickstart this value is indeed way too high and we don't expect users to run the full training (we should probably add a note!). The quickstart is mostly meant to showcase a minimal configuration and the main command lines.

When training on real data, this default value is more reasonable and the training could indeed take days. However, there are options to automatically stop the training earlier, but they are not enabled by default.

@cspink
Copy link
Author

cspink commented Mar 6, 2023

Thanks for a quick reply. Yes, I think a notice about expectations would be helpful. Usually, toy examples are very quick to run and evaluate. Perhaps also with some expectations about the output, so you know what you should be getting if OpenNMT runs correctly.

Presumably, one can abort training at any time and just evaluate any of the saved checkpoints?

@guillaumekln
Copy link
Contributor

Presumably, one can abort training at any time and just evaluate any of the saved checkpoints?

Yes.

@cspink
Copy link
Author

cspink commented Mar 6, 2023

Just to show you what I mean with expected output, what I am getting from 70k steps of training is so garbled that I am worried that something with the software installation has gone wrong. Perhaps ballparking BLEU results is an option?

$ head -n 5 src-test.txt tgt-test.txt pred-ckpt-70k.txt
==> src-test.txt <==
Orlando Bloom and Miranda Kerr still love each other
Actors Orlando Bloom and Model Miranda Kerr want to go their separate ways .
However , in an interview , Bloom has said that he and Kerr still love each other .
Miranda Kerr and Orlando Bloom are parents to two-year-old Flynn .
Actor Orlando Bloom announced his separation from his wife , supermodel Miranda Kerr .

==> tgt-test.txt <==
Orlando Bloom und Miranda Kerr lieben sich noch immer
Schauspieler Orlando Bloom und Model Miranda Kerr wollen künftig getrennte Wege gehen .
In einem Interview sagte Bloom jedoch , dass er und Kerr sich noch immer lieben .
Miranda Kerr und Orlando Bloom sind Eltern des zweijährigen Flynn .
Schauspieler Orlando Bloom hat sich zur Trennung von seiner Frau , Topmodel Miranda Kerr , geäußert .

==> pred-ckpt-70k.txt <==
und in der , die und alle anderen Techniken verwenden , um jede und , dass Sie auf , und sind , um jede Präfektur und alle in anderen und und und sind noch für sind .
Die Ramblas und des Seehafens von Barcelona .
Das , dass wir in , ist und dass er in und , ohne dass sie erobern müssen .
Die , und , dass sich mit den Eltern schlafen nicht Unterkunft kostenlos !
Die , dass Sie ist , das , , , , , , dass er am 1.1.1993 .

@guillaumekln
Copy link
Contributor

The dataset used in the quickstart is too small to get anything useful from a Transformer training. We would need to use a bigger training set in order to define an expected BLEU score.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants