Refactor generation sampling parameters (e.g. top k, temperature) into "Sampling" classes #5420

turtlesoupy · 2020-06-30T23:45:56Z

#4164 has a full description of the intention here. Basically, to avoid exploding generate(...) with more arguments, I've added one generic Sampler parameter that allows for arbitrary transformations of the generation probability distribution conditioned on the past. This allows users to specify custom ways of sampling (e.g. insert a specific token after a previous one, etc.)

In the process, I've added some basic tests around these samplers; existing tests pass otherwise.

…o Sampling classes

sshleifer

This looks like a huge improvement from a code readability and extensibility perspective! My only concern is performance.

This CI failure suggests that generation is slowed down.
The failing test is checking (very indirectly) how long a very small bart variant took to .generate on small batches.

From an accuracy perspective, we have some slow integration tests to make sure generation quality doesn't regress.

(These can be prefixed by USE_CUDA=1 if you are on GPU/want them to run faster.)

You should do one run of all the @slow tests using

RUN_SLOW=1 pytest tests/

The ones most likely to break are

RUN_SLOW=1 pytest tests/test_modeling_bart.py
RUN_SLOW=1 pytest tests/test_modeling_t5.py
RUN_SLOW=1 pytest tests/test_modeling_marian.py

turtlesoupy · 2020-07-02T06:18:03Z

@sshleifer thanks for taking a look. The run against the tests you mentioned (bart/t5/marian) passed when I gave them a kick. When you say performance, this approach should have the same amount of compute (each enabled Sampler runs once per generation loop) since it is just moving code around unless I missed something. Let me do a rebase and see if that CI failure goes away -- let me know if you have any other concerns!

patrickvonplaten · 2020-07-04T14:57:33Z

src/transformers/generation_utils.py

-                batch_size=batch_size,
-                num_beams=num_beams,
-            )
+            if sampler:


I think here we will not be able to keep backwards probability with beam_search + sampling because top_k_top_p_filtering is applied after the beam scores are added. I think from a logical point of view it does make more sense to apply top_k_top_p_filtering after adding the beam scores. On the other hand beam search sampling is not used that much and definitely an edge case....

IIUC the proposal would be, get the raw logits, normalize, add beam scores and then perform sampling using the transformed distribution? That makes sense to me; it seems like a design decision as to how to make these probability shifts interact with beam search. Is it covered in any literature?

Yeah, I would think the distribution should be transformed after the beam scores have been added. I don't know any literature here though. I'm not too concerned about beam search + sampling, but I'm not sure if we also restrict "greedy" beam search this way for future use cases. @yjernite @srush - do you have more insight here maybe?

patrickvonplaten · 2020-07-04T15:10:42Z

@turtlesoupy - thanks a lot for the PR! Cool design choice!

The generate method definitely needs a bigger refactor sooner or later and this is a cool idea on how to make it easier to add new probability distribution wrap functions. With this design I'm a bit worried that we restrict beam search too much in a sense that only the log_softmax of the "next_tokens" distribution can "wrapped" but not the summed distribution of the next_token_scorers + beam_scores. Here this will break the beam search + sampling case (if I understood the code correctly).

I guess a method that adapts the _beam_scores + next_token_scores could also be used in "greedy" beam search in the future and this design choice would block us a bit. But I'm not sure whether there are many use cases one would like to adapt _beam_scores + next_token_scores before appling top_k for "greedy" beam search...what are your thoughts on this? @turtlesoupy @yjernite @sshleifer

turtlesoupy · 2020-07-04T18:14:43Z

@patrickvonplaten I'm un-opinionated since my use cases weren't using beam search; the goal of this PR was so that I could introduce a my own sampler that enforced rules without having to fork the generate function.

For beam search, one approach could be to apply the warp to (next_token_scores + beam_scores) and then perform sampling afterwards. Then it is sampling from a consistent space and the hypothesis scores would be modified appropriately

stale · 2020-09-06T13:49:25Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Refactor generation sampling parameters (e.g. top k, temperature) int…

ea57c80

…o Sampling classes

This was referenced Jun 30, 2020

Refactor generation sampling parameters (e.g. top k, temperature) into "Sampling" classes #5416

Closed

Add a sampling_transform callback to generation for arbitrary probability-warps #4164

Closed

patrickvonplaten self-requested a review July 1, 2020 12:38

sshleifer reviewed Jul 1, 2020

View reviewed changes

turtlesoupy added 2 commits July 1, 2020 23:21

Merge remote-tracking branch 'upstream/master' into warp_prob

f4bdaf0

Fix import on tests

5884283

patrickvonplaten reviewed Jul 4, 2020

View reviewed changes

stale bot added the wontfix label Sep 6, 2020

patrickvonplaten mentioned this pull request Sep 11, 2020

Refactoring the generate() function #6949

Merged

7 tasks

stale bot closed this Sep 13, 2020

patrickvonplaten mentioned this pull request Jan 31, 2023

Generate: TF compute_transition_scores #21341

Merged

gante mentioned this pull request Apr 22, 2023

beam_sample throws a nan error on long generations #22914

Closed

4 tasks

gante mentioned this pull request Oct 16, 2023

Beam Search Fails for Llama 70b #26332

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor generation sampling parameters (e.g. top k, temperature) into "Sampling" classes #5420

Refactor generation sampling parameters (e.g. top k, temperature) into "Sampling" classes #5420

turtlesoupy commented Jun 30, 2020

sshleifer left a comment •

edited

Loading

turtlesoupy commented Jul 2, 2020

patrickvonplaten Jul 4, 2020

turtlesoupy Jul 4, 2020 •

edited

Loading

patrickvonplaten Jul 8, 2020

patrickvonplaten commented Jul 4, 2020 •

edited

Loading

turtlesoupy commented Jul 4, 2020

stale bot commented Sep 6, 2020

Refactor generation sampling parameters (e.g. top k, temperature) into "Sampling" classes #5420

Refactor generation sampling parameters (e.g. top k, temperature) into "Sampling" classes #5420

Conversation

turtlesoupy commented Jun 30, 2020

sshleifer left a comment • edited Loading

Choose a reason for hiding this comment

turtlesoupy commented Jul 2, 2020

patrickvonplaten Jul 4, 2020

Choose a reason for hiding this comment

turtlesoupy Jul 4, 2020 • edited Loading

Choose a reason for hiding this comment

patrickvonplaten Jul 8, 2020

Choose a reason for hiding this comment

patrickvonplaten commented Jul 4, 2020 • edited Loading

turtlesoupy commented Jul 4, 2020

stale bot commented Sep 6, 2020

sshleifer left a comment •

edited

Loading

turtlesoupy Jul 4, 2020 •

edited

Loading

patrickvonplaten commented Jul 4, 2020 •

edited

Loading