Generate: Add new decoding strategy "DoLa" in `.generate()` #29619

voidism · 2024-03-12T19:24:53Z

What does this PR do?

We add the support for a new decoding strategy proposed in a recent paper of ICLR 2024.
The main revisions are in src/transformers/generation/utils.py and src/transformers/generation/configuration_utils.py

We also update the documentation and add the test code. Run the test by:

CUDA_VISIBLE_DEVICES=0 python examples/pytorch/text-generation/run_generation_dola.py --model_name_or_path huggyllama/llama-7b --model_type llama --dola_layers 'low'

Before submitting

Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Yes, in Adding new decoding strategy "DoLa" into the model.generate() function #29524
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests? Yes, in examples/pytorch/text-generation/run_generation_dola.py

Who can review?

@gante is the main contributor of the part of .generate() function, which this PR focuses on.

gante

@voidism thank you for this cool PR! 🔥

In addition to the interface and user experience comments left below, there is one task missing: tests. We should add two tests:

A very small mixin test, to ensure the interface works on all models as expected. See here for an example.
One (or more) heavy integration test(s), to ensure the method retains its correctness as we add other changes. See here for an example. You can add them on any model you believe it's appropriate.

examples/pytorch/text-generation/run_generation_dola.py

src/transformers/generation/configuration_utils.py

src/transformers/generation/utils.py

gante · 2024-03-13T11:28:50Z

src/transformers/generation/utils.py

+                    mask = final_logits[0] < -1e3
+                    base_logits[0][mask] = -1e3


Can we add a comment about -1e3, for future reference? Why not any other number? It is okay if it is simply a number with which you got good results empirically 🤗

The line 2047 is removed, as I can directly get the mask from the _relative_top_filter function.
The -1e3 in line 2048 is simply a number tested work empirically. Any the number that is not -float("Inf") should be working as well. I have cleaned up the code and made them all in _relative_top_filter() function. The -1e3 is assigned as the base_filter_value variable.

src/transformers/generation/utils.py

voidism · 2024-03-19T01:59:19Z

Hi @gante !

Thanks so much for your suggestions! I spent some time to add the code for test cases, and fixed the issues you mentioned.
All the CI checks were passed as well. Can you take a look at my latest commits of the code?

Please let me know if you have any other concerns or suggestions for me to fix! I would be happy to address any of the issues you may have! 🤗

HuggingFaceDocBuilderDev · 2024-03-20T18:14:57Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

gante

Thank you for iterating 💛

It almost could be merged as is -- the tests need to be reworked slightly. I've added a few suggestions to further improve the PR while we wait for the green light from a core maintainer 🤗

docs/source/en/generation_strategies.md

src/transformers/generation/utils.py

tests/generation/test_utils.py

tests/models/gemma/test_modeling_gemma.py

tests/models/mixtral/test_modeling_mixtral.py

voidism · 2024-03-20T23:43:18Z

Hi @gante !

Thanks so much for your great suggestions! I have fixed all the issues you mentioned. Just let me know if you have any other concerns or suggestions!
Thanks for requesting a review from the core maintainer! 🤗

gante

Happy with the PR 🙌

voidism · 2024-03-21T18:46:57Z

Hi @gante !

While waiting for the core maintainer's approval, I found that the validation of the parameter ranges in the generation config mainly happens in tsrc/transformers/generation/configuration_utils.py instead of src/transformers/generation/utils.py. Thus, I simply moved the warning of repetition penalty of dola generation to configuration_utils.py, and the warning will also only occur once!

However, after I committed the new code. A test case of XLM model failed, and it seems to have nothing to do with my commit. The failed case seems related to #29297

I tried syncing with the upstream but it didn't solve the issue. I wonder if you know what's the reason for this failed test case. Sorry for bothering you again!

Some tests failed!

============================= FAILURES SHORT STACK =============================
____________________ XLMModelTest.test_batching_equivalence ____________________

tests/test_modeling_common.py:745: in recursive_check
    self.assertTrue(
E   AssertionError: tensor(False) is not true : Batched and Single row outputs are not equal in XLMForQuestionAnswering for key=end_top_index. Difference=1.


FAILED tests/models/xlm/test_modeling_xlm.py::XLMModelTest::test_batching_equivalence - AssertionError: tensor(False) is not true : Batched and Single row outputs are not equal in XLMForQuestionAnswering for key=end_top_index. Difference=1.

Exited with code exit status 255

voidism · 2024-03-22T15:11:32Z

The failed test case was solved after syncing with the upstream! Please ignore my previous comment.
It's ready to merge now!

voidism · 2024-03-25T15:10:16Z

Hi @amyeroberts !

This PR is ready to merge after some iterations! Would you be able to review it and give me any suggestions you have?
Thanks a lot for the help! 🤗

amyeroberts

Hi @voidism, thanks for working on adding this!

A few small comments. The main one being that the dola sampling method at the moment is way too large and needs to be broken down into smaller chunks

amyeroberts · 2024-03-21T11:54:15Z

tests/models/llama/test_modeling_llama.py

+            input_ids, max_new_tokens=64, top_p=None, temperature=1, do_sample=False, dola_layers="low"
+        )
+        text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
+        print("Answer here: ", text)


Suggested change

print("Answer here: ", text)

amyeroberts · 2024-03-22T14:04:09Z

tests/models/mistral/test_modeling_mistral.py

+            input_ids, max_new_tokens=20, temperature=0, dola_layers="low", repetition_penalty=1.2
+        )
+        text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
+        print("Answer here: ", text)


Suggested change

print("Answer here: ", text)

amyeroberts · 2024-03-22T14:05:32Z

tests/models/gemma/test_modeling_gemma.py

@@ -788,3 +789,25 @@ def test_model_7b_4bit(self):
        output_text = tokenizer.batch_decode(output, skip_special_tokens=True)

        self.assertEqual(output_text, EXPECTED_TEXTS)
+
+    def test_model_2b_bf16_dola(self):


I'd rather we didn't add an integration test for each of these models for this new generation method, as it's expensive to run. Doing this for each new generation approach isn't scalable.

Rather, it's better to just have one integration test for specific generation methods, which checks the output for a select model cc @gante

@amyeroberts I'd rather have it tested in a few key models, as we've been doing in the past for other generation methods -- generation tests are prone to false positives (due to argmax/sampling) and false negatives (due to a problem in the model used in a test).

But I understand our testing limitations, leaving the final call to you 🤗

Temperally removed the test for gemma! Let me know if you want me to add it back! 🤗

amyeroberts · 2024-03-25T15:30:40Z

tests/generation/test_utils.py

+                for model_name in [
+                    "wav2vec",
+                    "clvp",
+                    "bark",
+                ]


nit - one line

Suggested change

for model_name in [

"wav2vec",

"clvp",

"bark",

]

for model_name in ["wav2vec", "clvp", "bark"]

amyeroberts · 2024-03-25T15:30:48Z

tests/generation/test_utils.py

+                    "bark",
+                ]
+            ):
+                self.skipTest("Skip speech models")


I previously skipped these speech models because they don't have the regular output_embeddings to perform early exit. And the early exit is required for dola decoding. However, it's actually not just because they are speech models, we should simply check the output_embeddings to decide whether to skip!

Thus, I changed this part to

if model.get_output_embeddings() is None: self.skipTest("DoLa is not supported for models that don't have output embeddings")

amyeroberts · 2024-03-25T16:30:01Z

src/transformers/generation/utils.py

+            streamer.end()
+
+        if return_dict_in_generate:
+            if self.config.is_encoder_decoder:


In the tests it says that this isn't supperted by encoder_decoder models

Removed this part of the code that has if self.config.is_encoder_decoder:!

amyeroberts · 2024-03-25T16:34:35Z

tests/generation/test_utils.py

+            }
+            generation_kwargs.update({"dola_layers": "low"})
+            output_dola = model.generate(input_ids, attention_mask=attention_mask, **generation_kwargs)
+            self._check_outputs(output_dola, input_ids, model.config, use_cache=config.use_cache)


We should be able to do a test which does a single forward pass and checks that the expected logits are selected i.e. the dola method should be decoupled from generate itself and we test passing logits to the dola method and then the logit outputs. I believe this is a more general issue with the generation testing however.

Specifically, this test doesn't really convince me that the implementation is correct (not do the integration tests, unless they've been generated from the official dola implementation), but that they functionally work

We should be able to do a test which does a single forward pass and checks that the expected logits are selected i.e. the dola method should be decoupled from generate itself and we test passing logits to the dola method and then the logit outputs. I believe this is a more general issue with the generation testing however.

100% Agreed. However, this is not an issue with the DoLA method, but with the structure of generate. At the moment, each decoding function is a monolith where we can't isolate an iteration of the loop. Me and @zucchini-nlp are working to fix this problem, so we can breakdown (and test) each piece of the core functionality. For instance, you've recently reviewed a PR where the stopping condition of the generation loop was moved into a shared function, which works towards this goal 🤗

What this pattern of (legacy) tests does is to catch flagrant API issues and/or model incompatibilities, not to detect whether the decoding method matches its original implementation. And that's the extent of what we can do in unit tests, until we rework things :)

@amyeroberts What I mean with this comment is that it shouldn't be @voidism's responsibility to break down the _dola_decoding function nor to rework tests, @voidism is simply following the existing pattern. It is our (mine and @zucchini-nlp's) responsibility to ensure what you wrote becomes true -- in fact, it is easier for us to refactor things if they keep the same imperfect pattern.

For the correctness of DoLa. I am the first author of DoLa paper and I have kept tracking whether the new code in this PR can reproduce the old numbers in my paper.

The left-hand side is the new numbers I tested using the current version of code.
The right-hand side is the screenshot of my paper, where the numbers are from the official implementation and the experiments I did last year.

The original implementation was based on v4.28.1. The numbers changed a little bit (also for the greedy decoding baseline), which I think it's because of the version changes as well as the different machines and gpus I used. But the same level of improvement can be achieved by the new code in this PR, e.g. ~4% on StrQA with llama-7b.

I can also provide more tests to validate the consistency between this PR and my official dola implementation if you think it's needed!

@voidism Thanks for providing these numbers! I think these are good enough to have a reasonable degree of certainty in the application in the absence of being able to fully test at the moment

I have checked that my latest commit today (based on v4.41.0) can also reproduce the scores here!

amyeroberts · 2024-03-25T16:37:29Z

docs/source/en/generation_strategies.md

+# DoLa decoding with contrasting lower part of layers (layers 0,2,...,14)
+>>> dola_low_output = model.generate(**inputs, do_sample=False, max_new_tokens=50, dola_layers='low', repetition_penalty=1.2)
+>>> tokenizer.batch_decode(dola_low_output[:, inputs.input_ids.shape[-1]:], skip_special_tokens=True)
+['\nThe Declaration of Independence was signed on July 4, 1776.\nWhat was the date of the signing of the Declaration of Independence?\nThe Declaration of Independence was signed on July 4,']


I don't get it - the outputs are the same?

+1, otherwise users won't feel compelled into using the technique

Agreed here! I switched back to show the output example of dola_layers='high' as suggested by @gante last time, and removed the low outputs here. In this case, the high output is different from the vanilla decoding outputs and it makes more sense to the readers.

amyeroberts · 2024-03-25T16:39:06Z

docs/source/en/generation_strategies.md

+- If the model has tied word embeddings, we skip the word embeddings (0-th) layer and start from the 2nd layer, as the early exit from word embeddings will become identity function.
+- Set the `dola_layers` to a list of integers for layer indices to contrast manually specified layers. For example, setting `dola_layers=[28,30]` will contrast the final layer (32-th layer) with the 28-th and 30-th layers.
+
+The paper suggested that contrasting `'high'` layers to improve short-answer tasks like TruthfulQA, and contrasting `'low'` layers to improve all the other long-answer reasoning tasks, such as GSM8K, StrategyQA, FACTOR, and VicunaQA. Applying DoLa to smaller models like GPT-2 is not recommended, as the results shown in the Appendix N of the paper.


I would be good to use a better demo for low here

Switched back to show a demo of high! I can also try to find prompt cases that make vanilla and low and high all very different, if you think it's needed!

amyeroberts · 2024-03-25T16:39:45Z

docs/source/en/generation_strategies.md

+- For `N`-layer models with `N <= 40` layers, the layers of `range(0, N // 2, 2)` and `range(N // 2, N, 2)` are used for `'low'` and `'high'` layers, respectively. 
+- For models with `N > 40` layers, the layers of `range(0, 20, 2)` and `range(N - 20, N, 2)` are used for `'low'` and `'high'` layers, respectively.


hmmm - is this from the paper? It seems pretty arbitratry

Yes, the layer selection logic is in the Appendix F of my paper. For llama-7b we use [0, 16) and [16, 32). For llama-13b/33b/65b we use [0, 20) and [N-20, N), where N = 40/60/80 for 13b/33b/65b. They are selected based on the validation set results. In this PR, I renamed this layer selection as low or high for simplicity.

voidism · 2024-03-25T21:19:10Z

Hi @amyeroberts !

Thanks so much for all of your great suggestions! They are very helpful and they improved my code and the test cases!
I have tried my best to fix all the issues you mentioned above. Let me know if there are still concerns or suggestions so I can address them further! 🤗

amyeroberts

Thanks for iterating on this! Just a few small suggestions - otherwise looking great!

amyeroberts · 2024-03-28T14:29:06Z

tests/generation/test_utils.py

+            }
+            generation_kwargs.update({"dola_layers": "low"})
+            output_dola = model.generate(input_ids, attention_mask=attention_mask, **generation_kwargs)
+            self._check_outputs(output_dola, input_ids, model.config, use_cache=config.use_cache)


@voidism Thanks for providing these numbers! I think these are good enough to have a reasonable degree of certainty in the application in the absence of being able to fully test at the moment

amyeroberts · 2024-03-28T15:14:10Z

src/transformers/generation/utils.py

+        The method is based on the paper "DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models" (https://arxiv.org/abs/2309.03883) in ICLR 2024.
+
+        Parameters:
+            input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):


Slight mismatch between docstring and method signature e.g. do_sample missing

Fixed the mismatch. Now the docstring and method signature are consistent!

amyeroberts · 2024-03-28T15:16:57Z

src/transformers/generation/utils.py

+        >>> stopping_criteria = StoppingCriteriaList([MaxLengthCriteria(max_length=20)])
+
+        >>> torch.manual_seed(0)  # doctest: +IGNORE_RESULT
+        >>> outputs = model._dola_decoding(


I don't think we should show calling a private method in an example. My understanding from recent refactors is that this is now taken from the generation config @gante Is this right?

Correct. We can remove this example :)

Removed the example!

amyeroberts · 2024-03-28T15:24:42Z

src/transformers/generation/utils.py

+        # using final layer as the mature layer
+        mature_layer = self.config.num_hidden_layers
+        # if the model has tied word embeddings, we skip the word embeddings (0-th) layer and start from the 2nd layer, as the early exit from word embeddings will become identity function
+        # if the model is really shallow (<=2 layers), we use the 1st layer if it's not the mature layer and the 0-th layer if it's the mature layer. Notice that DoLa is not helping much to shallow models.


ultra nit

Possibly mature -> final to clarify? I'm not sure what a mature layer is i.e. above when it says using the final layer as the mature layer.

Suggested change

# if the model is really shallow (<=2 layers), we use the 1st layer if it's not the mature layer and the 0-th layer if it's the mature layer. Notice that DoLa is not helping much to shallow models.

# if the model is really shallow (<=2 layers), we use the 1st layer if it's not the mature layer and the 0-th layer otherwise. Notice that DoLa does not help shallow models much.

Changed all the mature layer into final layer!

amyeroberts · 2024-03-28T15:53:51Z

src/transformers/generation/utils.py

+            if return_dict_in_generate:
+                if output_scores:
+                    scores += (next_token_scores,)
+                if output_logits:
+                    raw_logits += (final_layer_next_token_logits,)
+                if output_attentions:
+                    decoder_attentions += (
+                        (outputs.decoder_attentions,) if self.config.is_encoder_decoder else (outputs.attentions,)
+                    )
+                    if self.config.is_encoder_decoder:
+                        cross_attentions += (outputs.cross_attentions,)
+
+                if output_hidden_states:
+                    decoder_hidden_states += (
+                        (outputs.decoder_hidden_states,)
+                        if self.config.is_encoder_decoder
+                        else (outputs.hidden_states,)
+                    )


Note for future @gante - this looks like something we can abstract out for this and other generation methods

Agreed 👍

amyeroberts · 2024-03-28T16:01:43Z

src/transformers/generation/utils.py

+    else:
+        # 1. Stacking all premature_layers into a new dimension
+        stacked_premature_layers = torch.stack(
+            [candidate_premature_logits[i] for i in candidate_premature_layers], dim=0
+        )
+
+        # 2. Calculate the softmax values for mature_layer and all premature_layers
+        softmax_mature_layer = F.softmax(final_logits, dim=-1)  # shape: (batch_size, vocab_size)
+        softmax_premature_layers = F.softmax(
+            stacked_premature_layers, dim=-1
+        )  # shape: (num_premature_layers, batch_size, vocab_size)
+
+        # 3. Calculate M, the average distribution
+        M = 0.5 * (
+            softmax_mature_layer[None, :, :] + softmax_premature_layers
+        )  # shape: (num_premature_layers, batch_size, vocab_size)
+
+        # 4. Calculate log-softmax for the KL divergence
+        log_softmax_mature_layer = F.log_softmax(final_logits, dim=-1)  # shape: (batch_size, vocab_size)
+        log_softmax_premature_layers = F.log_softmax(
+            stacked_premature_layers, dim=-1
+        )  # shape: (num_premature_layers, batch_size, vocab_size)
+
+        # 5. Calculate the KL divergences and then the JS divergences
+        kl1 = F.kl_div(log_softmax_mature_layer[None, :, :], M, reduction="none").mean(
+            -1
+        )  # shape: (num_premature_layers, batch_size)
+        kl2 = F.kl_div(log_softmax_premature_layers, M, reduction="none").mean(
+            -1
+        )  # shape: (num_premature_layers, batch_size)
+        js_divs = 0.5 * (kl1 + kl2)  # shape: (num_premature_layers, batch_size)
+
+        # 6. Reduce the batchmean
+        js_divs = js_divs.mean(-1)  # shape: (num_premature_layers,)
+        premature_layer = candidate_premature_layers[int(js_divs.argmax().cpu().item())]
+
+        base_logits = candidate_premature_logits[premature_layer]
+        final_logits, base_logits = _relative_top_filter(final_logits, base_logits)
+        logits = final_logits - base_logits
+    return logits


We can just do an early return here which avoids having the main block of code indented

Comments above the line of code to avoid unnecessary spliting

Suggested change

else:

# 1. Stacking all premature_layers into a new dimension

stacked_premature_layers = torch.stack(

[candidate_premature_logits[i] for i in candidate_premature_layers], dim=0

)

# 2. Calculate the softmax values for mature_layer and all premature_layers

softmax_mature_layer = F.softmax(final_logits, dim=-1) # shape: (batch_size, vocab_size)

softmax_premature_layers = F.softmax(

stacked_premature_layers, dim=-1

) # shape: (num_premature_layers, batch_size, vocab_size)

# 3. Calculate M, the average distribution

M = 0.5 * (

softmax_mature_layer[None, :, :] + softmax_premature_layers

) # shape: (num_premature_layers, batch_size, vocab_size)

# 4. Calculate log-softmax for the KL divergence

log_softmax_mature_layer = F.log_softmax(final_logits, dim=-1) # shape: (batch_size, vocab_size)

log_softmax_premature_layers = F.log_softmax(

stacked_premature_layers, dim=-1

) # shape: (num_premature_layers, batch_size, vocab_size)

# 5. Calculate the KL divergences and then the JS divergences

kl1 = F.kl_div(log_softmax_mature_layer[None, :, :], M, reduction="none").mean(

-1

) # shape: (num_premature_layers, batch_size)

kl2 = F.kl_div(log_softmax_premature_layers, M, reduction="none").mean(

-1

) # shape: (num_premature_layers, batch_size)

js_divs = 0.5 * (kl1 + kl2) # shape: (num_premature_layers, batch_size)

# 6. Reduce the batchmean

js_divs = js_divs.mean(-1) # shape: (num_premature_layers,)

premature_layer = candidate_premature_layers[int(js_divs.argmax().cpu().item())]

base_logits = candidate_premature_logits[premature_layer]

final_logits, base_logits = _relative_top_filter(final_logits, base_logits)

logits = final_logits - base_logits

return logits

return logits

# 1. Stacking all premature_layers into a new dimension

stacked_premature_layers = torch.stack(

[candidate_premature_logits[i] for i in candidate_premature_layers], dim=0

)

# 2. Calculate the softmax values for mature_layer and all premature_layers

# shape: (batch_size, vocab_size)

softmax_mature_layer = F.softmax(final_logits, dim=-1)

# shape: (num_premature_layers, batch_size, vocab_size)

softmax_premature_layers = F.softmax(stacked_premature_layers, dim=-1)

# 3. Calculate M, the average distribution

# shape: (num_premature_layers, batch_size, vocab_size)

M = 0.5 * (softmax_mature_layer[None, :, :] + softmax_premature_layers)

# 4. Calculate log-softmax for the KL divergence

# shape: (batch_size, vocab_size)

log_softmax_mature_layer = F.log_softmax(final_logits, dim=-1)

# shape: (num_premature_layers, batch_size, vocab_size)

log_softmax_premature_layers = F.log_softmax(stacked_premature_layers, dim=-1)

# 5. Calculate the KL divergences and then the JS divergences

# shape: (num_premature_layers, batch_size)

kl1 = F.kl_div(log_softmax_mature_layer[None, :, :], M, reduction="none").mean(-1)

# shape: (num_premature_layers, batch_size)

kl2 = F.kl_div(log_softmax_premature_layers, M, reduction="none").mean(-1)

js_divs = 0.5 * (kl1 + kl2) # shape: (num_premature_layers, batch_size)

# 6. Reduce the batchmean

js_divs = js_divs.mean(-1) # shape: (num_premature_layers,)

premature_layer = candidate_premature_layers[int(js_divs.argmax().cpu().item())]

base_logits = candidate_premature_logits[premature_layer]

final_logits, base_logits = _relative_top_filter(final_logits, base_logits)

logits = final_logits - base_logits

return logits

amyeroberts · 2024-03-28T16:03:02Z

src/transformers/generation/utils.py

+        # 3. Calculate M, the average distribution
+        M = 0.5 * (


As a rule, no single letter vars should be used - let's use something more descriptive e.g. avg_dist

Changed it to avg_dist!

amyeroberts · 2024-03-28T16:04:17Z

src/transformers/generation/utils.py

+    return logits
+
+
+def _relative_top_filter(


Definition of objects should go above the lines they're first used

Fixed the order!

amyeroberts · 2024-03-28T16:05:02Z

src/transformers/generation/utils.py

+    base_filter_value=-1e-3,
+    min_tokens_to_keep: int = 1,
+) -> torch.FloatTensor:
+    """Reference: https://github.com/XiangLi1999/ContrastiveDecoding/blob/170e9142e92159c1237d731e240f5eb14aabf428/transformers/src/transformers/generation_logits_process.py#L235"""


Link is great! We should add a short sentence saying what this function does too

Added a description!

amyeroberts · 2024-03-28T16:12:18Z

tests/generation/test_utils.py

+            if not hasattr(config, "use_cache"):
+                config.use_cache = False
+            else:
+                config.use_cache = True


Based on https://github.com/huggingface/transformers/pull/29619/files#r1538243054

Suggested change

if not hasattr(config, "use_cache"):

config.use_cache = False

else:

config.use_cache = True

# Some models don't support the cache and returning past_key_values

if not hasattr(config, "use_cache"):

config.use_cache = False

else:

config.use_cache = True

Added the comment!

github-actions · 2024-04-22T08:04:07Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

gante · 2024-04-22T10:23:09Z

@voidism are you intending to continue the PR? 🤗 Or do you need a hand?

voidism · 2024-04-22T13:06:33Z

Hi @gante

Sorry that I was busy with my midterm for the past few weeks 😔 so I forgot to fix this for a while... I will continue fixing the PR this or next week!
Thanks for the reminder and sorry for the delay!

gante · 2024-04-23T18:02:10Z

@voidism no worries, focus on your midterms 💪 we'll be here when you're ready to continue 🙌

github-actions · 2024-05-18T08:04:16Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

voidism · 2024-05-19T04:55:22Z

Hi @gante and @amyeroberts

I am back and fixed all the suggestions from @amyeroberts last time!

Sorry that I was busy with midterm exams and paper deadlines last month 😔, so I stopped fixing this PR for a while. 🥲
And last week I just traveled to ICLR 2024 to present the DoLa paper! Now I finally get some free time to fix this.

It's my fault that you guys might need to spend more time recalling our discussions from almost two months ago. I am really sorry about that! 🥲

In addition to fixing all the suggestions from last time, I have synced this PR with the latest transformers v4.41.0, and it passed all the CI tests. I found that the new version of generation/utils.py becomes more concise and cleaner than before. Thanks so much for your efforts in making it better!

Let me know if you have any other concerns or suggestions. I recently have more free time so I can assure you guys that I will fix any of your new suggestions as soon as I can! No more procrastination I promise!

amyeroberts

Thanks for adding and iterating!

All looks good to me. As it's been open for a while, I'd like a quick re-review from @gante to confirm this is still in-line with the current generate patterns

voidism · 2024-05-29T04:23:02Z

Thanks @amyeroberts so much for approving the changes! 🙌

Hi @gante Just let me know if the current version looks good or not. I will be happy to fix any suggestions or concerns you have! Thanks! 🤗

github-actions · 2024-06-22T08:05:41Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

gante · 2024-06-22T13:30:11Z

@voidism my turn to apologise for the delay, I'm catching up with issues :)

I've re-checked the PR and I'm happy with it! I'm going to merge this Monday (to avoid breaking our CI on a weekend 😉 )

voidism · 2024-06-22T16:18:08Z

Hi @gante

No problem! Thanks so much for your help!! 🤗

…tion_utils.py`

gante · 2024-07-09T14:37:24Z

rebased yet again (previous main had unrelated issues that was making CI red), fixing resulting issues in next commits

gante · 2024-07-09T16:37:21Z

Ran the following slow tests locally (with the expected results):

RUN_SLOW=1 py.test tests/models/ -k dola -vv
RUN_SLOW=1 py.test -vv tests/models/llama/test_modeling_llama.py
RUN_SLOW=1 py.test -vv tests/generation/test_utils.py
RUN_SLOW=1 py.test -vv tests/utils/test_cache_utils.py

gante · 2024-07-09T16:38:20Z

@voidism finally all CI issues were sorted -- thank you for bearing with us 🤗 I will communicate about this feature tomorrow! 💪

voidism · 2024-07-09T18:18:59Z

Hi @gante

Thanks a lot for your help! Handling these CI tests isn't easy (I learned a lot from it 😂). I really appreciate your effort. So happy that we finally made it! 🤗

gante · 2024-07-10T14:09:29Z

Handling these CI tests isn't easy

@voidism hehe it looks annoying, but it is essential to ensure all our features are playing nicely with each other 🤗

gante self-requested a review March 13, 2024 09:44

gante reviewed Mar 13, 2024

View reviewed changes

gante reviewed Mar 20, 2024

View reviewed changes

gante requested a review from amyeroberts March 20, 2024 19:08

gante approved these changes Mar 21, 2024

View reviewed changes

amyeroberts reviewed Mar 25, 2024

View reviewed changes

amyeroberts reviewed Mar 28, 2024

View reviewed changes

amyeroberts reviewed May 24, 2024

View reviewed changes

amyeroberts approved these changes May 24, 2024

View reviewed changes

voidism added 4 commits July 9, 2024 14:22

add support for DoLa decoding

5c35893

add docs; remove deprecated function

bb66df5

add test code for DoLa decoding

9769d6d

update docs and paper link

4dea208

voidism and others added 17 commits July 9, 2024 14:31

remove keyword argument 'model_inputs' to match upstream changes

87fc406

improve documentation

efe3e34

fixed suggestions from @gante

8654787

ruff reformated

7fcf990

moved config warning of dola generation from utils.py to `configura…

21a646a

…tion_utils.py`

fixed suggestions from @amyeroberts

2b81d73

fixed format issue; removed print; added explanation

cc067cc

remove trailing whitespace

679afea

ruff reformat to pass test

7e9367b

fixed suggestions from @amyeroberts on Mar 28

4bd54c8

fix failed CI tests

92de5cd

ruff reformatted; fixed missing argument generation_config

cff5661

make dola_layers not optional

ba61bf4

fix divergence w main

87ea8d8

fix dola test on mamba

57c89af

rwkv test (wont fix

1c44900

slow tests running in fp16

9d9f894

gante force-pushed the main branch from ebb7f84 to 9d9f894 Compare July 9, 2024 14:36

gante added 6 commits July 9, 2024 14:44

make fixup

a8993ef

remove redundant fn

aaf560f

final rebase divergences

dc2192c

this one was missing

520202d

a few more nits

ce64a5f

skip stateful models

8b6653c

gante merged commit d094d8d into huggingface:main Jul 9, 2024
23 checks passed

		- For `N`-layer models with `N <= 40` layers, the layers of `range(0, N // 2, 2)` and `range(N // 2, N, 2)` are used for `'low'` and `'high'` layers, respectively.
		- For models with `N > 40` layers, the layers of `range(0, 20, 2)` and `range(N - 20, N, 2)` are used for `'low'` and `'high'` layers, respectively.

	# if the model is really shallow (<=2 layers), we use the 1st layer if it's not the mature layer and the 0-th layer if it's the mature layer. Notice that DoLa is not helping much to shallow models.
	# if the model is really shallow (<=2 layers), we use the 1st layer if it's not the mature layer and the 0-th layer otherwise. Notice that DoLa does not help shallow models much.

Generate: Add new decoding strategy "DoLa" in .generate() #29619

Generate: Add new decoding strategy "DoLa" in .generate() #29619

Conversation

voidism commented Mar 12, 2024

What does this PR do?

Before submitting

Who can review?

gante left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

voidism commented Mar 19, 2024

HuggingFaceDocBuilderDev commented Mar 20, 2024

gante left a comment

Choose a reason for hiding this comment

voidism commented Mar 20, 2024

gante left a comment

Choose a reason for hiding this comment

voidism commented Mar 21, 2024 • edited Loading

voidism commented Mar 22, 2024

voidism commented Mar 25, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

voidism Mar 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

voidism commented Mar 25, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Apr 22, 2024

gante commented Apr 22, 2024 • edited Loading

voidism commented Apr 22, 2024

gante commented Apr 23, 2024

github-actions bot commented May 18, 2024

voidism commented May 19, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

Generate: Add new decoding strategy "DoLa" in `.generate()` #29619

Generate: Add new decoding strategy "DoLa" in `.generate()` #29619

gante left a comment •

edited

Loading

voidism commented Mar 21, 2024 •

edited

Loading

voidism Mar 25, 2024 •

edited

Loading

gante commented Apr 22, 2024 •

edited

Loading