Any way we can get dropout on full finetune? #672

enn-nafnlaus · 2023-10-04T09:18:03Z

⚠️ Please check that this feature request hasn't been suggested before.

I searched previous Ideas in Discussions didn't find any similar feature requests.
I searched previous Issues didn't find any similar feature requests.

🔖 Feature description

Full finetune suffers badly from epoch spikes, making any training lasting past the end of an epoch (and esp. 2 or more epochs) difficult to get any further progress from it. A deeper understanding of the data should be able to be achieved with dropout. But while there's lora_dropout, we don't have any dropout option available for full finetune. Anyway we could get that added?

✔️ Solution

Add dropout

❓ Alternatives

lora_dropout only applies to LoRAs.

📝 Additional Context

No response

Acknowledgements

My issue title is concise, descriptive, and in title casing.
I have searched the existing issues to make sure this feature has not been requested yet.
I have provided enough information for the maintainers to understand and evaluate this request.

NanoCode012 · 2023-10-05T08:22:55Z

Dropout feature in LoRA is due to a feature in PEFT upstream. If we want to add dropout, we would need to modify the architecture ourselves since it's not a built in thing. Not sure if that's a good idea.

Alternatively, perhaps, a Lower lr might be good for you or to experiment with schedulers?

enn-nafnlaus · 2023-10-05T08:33:03Z

Lower LR can hide the spike but doesn't help spread out learning / prevent overconcentration of functionality in specific neurons and bottlenecking. And actually if you use a lower LR throughout the whole training (we only have limited tools for nuanced LR control over time) you actually get a worse eval loss because you start hitting epochs when you're less far along in the training process.

Is there an upstream library where it would be better to add dropout?

Another option would be L2 regularization or any of the other dropout alternatives. Through directly dropping out parts of the network (whether through traditional dropout, DropPath, or whatnot) is AFAIK the most effective means.

I know most people are using axolotl to train LoRAs, but these epoch spikes with finetuning are a big problem.

NanoCode012 · 2023-10-05T08:38:40Z

You would need to modify the architecture model code itself in modeling code. For ex, llama: https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py

However, I do not have enough expertise in updating this to add layers.

NanoCode012 · 2023-10-05T19:28:15Z

I will close this for now as an Issue has been open upstream. Please let us know if this needs to be re-opened later due to an update.

winglian · 2023-10-05T19:59:23Z

@enn-nafnlaus btw, I was doing some experimentation w dropout. I don't know if this iteration works out of the box, but might be a good starting point. main...llama-dropout

winglian · 2023-10-05T20:00:11Z

similar one too for mistral
main...mistral-dropoout

NanoCode012 · 2023-10-05T20:04:23Z

Didn't know there was an open branch. Reopening back.

enn-nafnlaus · 2023-10-05T21:56:09Z

@enn-nafnlaus btw, I was doing some experimentation w dropout. I don't know if this iteration works out of the box, but might be a good starting point. main...llama-dropout

Ooh, nice! I'm regenerating a larger dataset at the moment, but will try it as soon as my cards are freed up! :)

enn-nafnlaus · 2023-10-11T22:15:29Z

Just tried both branches, sadly no luck.

mistral-dropout:

...
File "/home/user/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 594, in init
LlamaAttention(config=config)
File "/home/user/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 283, in init
self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias)
File "/home/user/.local/lib/python3.10/site-packages/transformers/configuration_utils.py", line 261, in getattribute
return super().getattribute(key)
AttributeError: 'MistralConfig' object has no attribute 'attention_bias

It keeps going after that error, but I run out of memory, seems I can't fit it in either just a RTX 3090 or a RTX 3090+3060. So, dead end.

llama-dropout:

[2023-10-11 22:07:21,996] [INFO] [axolotl.load_model:176] [PID:2617024] [RANK:0] patching _expand_mask
[2023-10-11 22:07:29,473] [ERROR] [axolotl.load_model:334] [PID:2617025] [RANK:1] Exception raised attempting to load model, retrying with AutoModelForCausalLM
[2023-10-11 22:07:29,473] [ERROR] [axolotl.load_model:334] [PID:2617024] [RANK:0] Exception raised attempting to load model, retrying with AutoModelForCausalLM
[2023-10-11 22:07:29,473] [ERROR] [axolotl.load_model:337] [PID:2617025] [RANK:1] 'LlamaConfig' object has no attribute 'dropout_attn'
Traceback (most recent call last):
File "/home/user/axolotl/src/axolotl/utils/models.py", line 227, in load_model
model = LlamaForCausalLM.from_pretrained(
File "/home/user/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3076, in from_pretrained
model = cls(config, *model_args, **model_kwargs)
File "/home/user/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 961, in init
self.model = LlamaModel(config)
File "/home/user/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 786, in init
self.layers = nn.ModuleList([LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)])
File "/home/user/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 786, in
self.layers = nn.ModuleList([LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)])
File "/home/user/axolotl/src/axolotl/monkeypatch/llama_attn_hijack_flash.py", line 593, in init
if config.dropout_attn:
File "/home/user/.local/lib/python3.10/site-packages/transformers/configuration_utils.py", line 261, in getattribute
return super().getattribute(key)
AttributeError: 'LlamaConfig' object has no attribute 'dropout_attn'

This is with model winglian/llama-2-4b.

Obviously we don't have support for falcon at all, let alone dropout, so I can't try that.

enn-nafnlaus · 2023-10-23T16:57:23Z

Any progress on this? :)

enn-nafnlaus · 2023-11-01T15:04:17Z

Found a small enough mistral model that I can actually try it (hongyin/mistral-0.5b-40k). Unfortunately, it's a terrible model compared to, say, PY007/TinyLlama-1.1B-intermediate-step-480k-1T. I retested both dropout branches. llama-dropout is still broken in the same manner. I tried the mistral-dropout branch. Weirdly, it runs out of VRAM. This does not happen on the main branch. I could try reducing e.g. batch size, but I don't get why its VRAM footprint should be any different from the mainline branch....

Going to try an alternative to dropout to deal with the loss spikes at end-of-epoch - I wrote a script to randomly tweak the input data with synonyms, antonyms, hypernyms and hyponyms, as well as various minor text formatting changes, so be able to multiply out the input data and thus hopefully decrease the ability of the model to memorize the training data when running multiple epochs. Dunno if it'll work, but it's a stopgap...

(Unrelated to normal dropout, but it did occur to me that Learning Rate Dropout would be a cool feature. I don't know if it's been mainlined in PyTorch, though. But in theory it should allow for faster learning with a smaller memory footprint by having only a random fraction of the nodes involved in backpropagation, with the others just running inference and gradient accumulation)

enn-nafnlaus · 2023-11-13T14:20:50Z

Hey, dropout just got added upstream!

huggingface/transformers#27315

Hopefully we can use that in axolotl soon!

NanoCode012 · 2023-11-15T13:45:41Z

Cool @enn-nafnlaus !

Seems like you can manually do this by adding the config attention_dropout to your config.json.

Alternatively, if you would like to PR, it's adding a new param to the yaml and setting like this rope sample for llama:
https://github.com/OpenAccess-AI-Collective/axolotl/blob/614cff41077839a6c1380275ada954537f01c0ed/src/axolotl/utils/models.py#L242

enn-nafnlaus · 2023-11-15T13:53:16Z

Will try adding attention_dropout to my yaml this evening. Any way to know if it's actually being used apart from a difference being visible in the training outputs?

winglian · 2023-11-15T13:59:26Z

I've noticed the train loss can get pretty high, even with a 0.05 dropout rate.

enn-nafnlaus · 2023-11-18T00:29:34Z

Trying it out this evening. I don't have a config.json file. My models.py (just did a git pull this evening) doesn't look like the one you link - the closest equivalent to the code you pointed at is:

        elif cfg.is_llama_derived_model and not cfg.trust_remote_code:
            from transformers import LlamaForCausalLM

            config = LlamaConfig.from_pretrained(base_model_config)

I tried hacking something to have an equivalent impact - hopefully it works.

        elif cfg.is_llama_derived_model and not cfg.trust_remote_code:
            from transformers import LlamaForCausalLM

            config_kwargs = {}
            if cfg.attention_dropout:
                config_kwargs["attention_dropout"] = cfg.attention_dropout
                LOG.debug()
                LOG.debug("===================================")
                LOG.debug("ATTENTION DROPOUT ADDED!")
                LOG.debug("===================================")
                LOG.debug()
    
            config = LlamaConfig.from_pretrained(base_model_config, **config_kwargs)

Kinda awkward that I can't see if it's being used or not. But I set dropout to 0.2 so hopefully there will be an obvious impact...

(I've given up on doing github PRs... the Github side and approval side is 10 times more effort than doing the actual code changes)

enn-nafnlaus · 2023-11-18T01:11:00Z

ED: Nope, not seeing a difference between 0.2 and 0.05... I doubt it's being used. Also not seeing my debug show up.

NanoCode012 · 2023-11-18T02:05:10Z

Hey @enn-nafnlaus , a PR has just been merged to facilitate this.

You can just pass the following to yaml

model_config:
    attention_dropout: 0.01

Please let us know how it goes

enn-nafnlaus · 2023-11-19T15:27:08Z

Hey @enn-nafnlaus , a PR has just been merged to facilitate this.

You can just pass the following to yaml
model_config:
    attention_dropout: 0.01
Please let us know how it goes

I can now verify that it does indeed affect training :)

Before I do a serious training run to evaluate its impacts in preventing eval spikes at end-of-epoch / overfitting to the training data, I need to create a new training baseline, as my base model (TinyLlama) just had a new release. Will update once I have a good answer.

enn-nafnlaus · 2023-11-27T15:07:36Z

While I still plan to do more test runs with different LRs, LR schedules, and weight decays, I'm prepared to say that this feature is now:

A) implemented, and
B) absolutely useful for dealing with epoch spikes and improving generalization).

Here's 25% dropout (purple) vs. no dropout (orange), both at 0.1 weight decay, the no-dropout (orange) case using what I previously determined as an optimal LR and schedule (inv_sqrt, lr=0.00003 - has to try to get as much learning done as possible before epoch while also having a greatly reduced LR after epoch to reduce the spike severity). The dropout (purple) case is running on a cos schedule with lr=0.000005, as it has the initial dropout-induced loss to overcome but can run for longer without severe epoch spikes.

Note in particular how similar eval_loss is to train_loss on purple (dropout) vs. how far it is on orange (no dropout). I may be able to improve learning further with further tuning of schedulers, LR, and weight decay. Note that the original dropout paper used up to 0.5, although obviously that only makes sense if you're going to finetune for a lot of epochs;if you don't want a big initial loss penalty to overcome, you can stick with a much lower dropout than I used (e.g. single digits, even low single digits).

Anyway, as far as I'm concerned, this can now be closed as successfully implemented!

enn-nafnlaus added the enhancement New feature or request label Oct 4, 2023

enn-nafnlaus mentioned this issue Oct 5, 2023

Any way we can get dropout added to modeling_llama.py? huggingface/transformers#26616

Closed

NanoCode012 closed this as completed Oct 5, 2023

NanoCode012 reopened this Oct 5, 2023

winglian mentioned this issue Nov 15, 2023

allow overriding of model_config parameters from the YML #853

Merged

enn-nafnlaus closed this as completed Nov 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any way we can get dropout on full finetune? #672

Any way we can get dropout on full finetune? #672

enn-nafnlaus commented Oct 4, 2023

NanoCode012 commented Oct 5, 2023

enn-nafnlaus commented Oct 5, 2023 •

edited

Loading

NanoCode012 commented Oct 5, 2023

NanoCode012 commented Oct 5, 2023

winglian commented Oct 5, 2023

winglian commented Oct 5, 2023

NanoCode012 commented Oct 5, 2023

enn-nafnlaus commented Oct 5, 2023

enn-nafnlaus commented Oct 11, 2023

enn-nafnlaus commented Oct 23, 2023

enn-nafnlaus commented Nov 1, 2023 •

edited

Loading

enn-nafnlaus commented Nov 13, 2023

NanoCode012 commented Nov 15, 2023

enn-nafnlaus commented Nov 15, 2023

winglian commented Nov 15, 2023

enn-nafnlaus commented Nov 18, 2023 •

edited

Loading

enn-nafnlaus commented Nov 18, 2023 •

edited

Loading

NanoCode012 commented Nov 18, 2023

enn-nafnlaus commented Nov 19, 2023 •

edited

Loading

enn-nafnlaus commented Nov 27, 2023

Any way we can get dropout on full finetune? #672

Any way we can get dropout on full finetune? #672

Comments

enn-nafnlaus commented Oct 4, 2023

⚠️ Please check that this feature request hasn't been suggested before.

🔖 Feature description

✔️ Solution

❓ Alternatives

📝 Additional Context

Acknowledgements

NanoCode012 commented Oct 5, 2023

enn-nafnlaus commented Oct 5, 2023 • edited Loading

NanoCode012 commented Oct 5, 2023

NanoCode012 commented Oct 5, 2023

winglian commented Oct 5, 2023

winglian commented Oct 5, 2023

NanoCode012 commented Oct 5, 2023

enn-nafnlaus commented Oct 5, 2023

enn-nafnlaus commented Oct 11, 2023

enn-nafnlaus commented Oct 23, 2023

enn-nafnlaus commented Nov 1, 2023 • edited Loading

enn-nafnlaus commented Nov 13, 2023

NanoCode012 commented Nov 15, 2023

enn-nafnlaus commented Nov 15, 2023

winglian commented Nov 15, 2023

enn-nafnlaus commented Nov 18, 2023 • edited Loading

enn-nafnlaus commented Nov 18, 2023 • edited Loading

NanoCode012 commented Nov 18, 2023

enn-nafnlaus commented Nov 19, 2023 • edited Loading

enn-nafnlaus commented Nov 27, 2023

enn-nafnlaus commented Oct 5, 2023 •

edited

Loading

enn-nafnlaus commented Nov 1, 2023 •

edited

Loading

enn-nafnlaus commented Nov 18, 2023 •

edited

Loading

enn-nafnlaus commented Nov 18, 2023 •

edited

Loading

enn-nafnlaus commented Nov 19, 2023 •

edited

Loading