-
-
Notifications
You must be signed in to change notification settings - Fork 861
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Any way we can get dropout on full finetune? #672
Comments
Dropout feature in LoRA is due to a feature in PEFT upstream. If we want to add dropout, we would need to modify the architecture ourselves since it's not a built in thing. Not sure if that's a good idea. Alternatively, perhaps, a Lower lr might be good for you or to experiment with schedulers? |
Lower LR can hide the spike but doesn't help spread out learning / prevent overconcentration of functionality in specific neurons and bottlenecking. And actually if you use a lower LR throughout the whole training (we only have limited tools for nuanced LR control over time) you actually get a worse eval loss because you start hitting epochs when you're less far along in the training process. Is there an upstream library where it would be better to add dropout? Another option would be L2 regularization or any of the other dropout alternatives. Through directly dropping out parts of the network (whether through traditional dropout, DropPath, or whatnot) is AFAIK the most effective means. I know most people are using axolotl to train LoRAs, but these epoch spikes with finetuning are a big problem. |
You would need to modify the architecture model code itself in modeling code. For ex, llama: https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py However, I do not have enough expertise in updating this to add layers. |
I will close this for now as an Issue has been open upstream. Please let us know if this needs to be re-opened later due to an update. |
@enn-nafnlaus btw, I was doing some experimentation w dropout. I don't know if this iteration works out of the box, but might be a good starting point. main...llama-dropout |
similar one too for mistral |
Didn't know there was an open branch. Reopening back. |
Ooh, nice! I'm regenerating a larger dataset at the moment, but will try it as soon as my cards are freed up! :) |
Just tried both branches, sadly no luck. mistral-dropout: ... It keeps going after that error, but I run out of memory, seems I can't fit it in either just a RTX 3090 or a RTX 3090+3060. So, dead end. llama-dropout: [2023-10-11 22:07:21,996] [INFO] [axolotl.load_model:176] [PID:2617024] [RANK:0] patching _expand_mask This is with model winglian/llama-2-4b. Obviously we don't have support for falcon at all, let alone dropout, so I can't try that. |
Any progress on this? :) |
Found a small enough mistral model that I can actually try it (hongyin/mistral-0.5b-40k). Unfortunately, it's a terrible model compared to, say, PY007/TinyLlama-1.1B-intermediate-step-480k-1T. I retested both dropout branches. llama-dropout is still broken in the same manner. I tried the mistral-dropout branch. Weirdly, it runs out of VRAM. This does not happen on the main branch. I could try reducing e.g. batch size, but I don't get why its VRAM footprint should be any different from the mainline branch.... Going to try an alternative to dropout to deal with the loss spikes at end-of-epoch - I wrote a script to randomly tweak the input data with synonyms, antonyms, hypernyms and hyponyms, as well as various minor text formatting changes, so be able to multiply out the input data and thus hopefully decrease the ability of the model to memorize the training data when running multiple epochs. Dunno if it'll work, but it's a stopgap... (Unrelated to normal dropout, but it did occur to me that Learning Rate Dropout would be a cool feature. I don't know if it's been mainlined in PyTorch, though. But in theory it should allow for faster learning with a smaller memory footprint by having only a random fraction of the nodes involved in backpropagation, with the others just running inference and gradient accumulation) |
Hey, dropout just got added upstream! huggingface/transformers#27315 Hopefully we can use that in axolotl soon! |
Cool @enn-nafnlaus ! Seems like you can manually do this by adding the config Alternatively, if you would like to PR, it's adding a new param to the yaml and setting like this rope sample for llama: |
Will try adding attention_dropout to my yaml this evening. Any way to know if it's actually being used apart from a difference being visible in the training outputs? |
I've noticed the train loss can get pretty high, even with a 0.05 dropout rate. |
Trying it out this evening. I don't have a config.json file. My models.py (just did a git pull this evening) doesn't look like the one you link - the closest equivalent to the code you pointed at is:
I tried hacking something to have an equivalent impact - hopefully it works.
Kinda awkward that I can't see if it's being used or not. But I set dropout to 0.2 so hopefully there will be an obvious impact... (I've given up on doing github PRs... the Github side and approval side is 10 times more effort than doing the actual code changes) |
ED: Nope, not seeing a difference between 0.2 and 0.05... I doubt it's being used. Also not seeing my debug show up. |
Hey @enn-nafnlaus , a PR has just been merged to facilitate this. You can just pass the following to yaml model_config:
attention_dropout: 0.01 Please let us know how it goes |
I can now verify that it does indeed affect training :) Before I do a serious training run to evaluate its impacts in preventing eval spikes at end-of-epoch / overfitting to the training data, I need to create a new training baseline, as my base model (TinyLlama) just had a new release. Will update once I have a good answer. |
While I still plan to do more test runs with different LRs, LR schedules, and weight decays, I'm prepared to say that this feature is now: A) implemented, and Here's 25% dropout (purple) vs. no dropout (orange), both at 0.1 weight decay, the no-dropout (orange) case using what I previously determined as an optimal LR and schedule (inv_sqrt, lr=0.00003 - has to try to get as much learning done as possible before epoch while also having a greatly reduced LR after epoch to reduce the spike severity). The dropout (purple) case is running on a cos schedule with lr=0.000005, as it has the initial dropout-induced loss to overcome but can run for longer without severe epoch spikes. Note in particular how similar eval_loss is to train_loss on purple (dropout) vs. how far it is on orange (no dropout). I may be able to improve learning further with further tuning of schedulers, LR, and weight decay. Note that the original dropout paper used up to 0.5, although obviously that only makes sense if you're going to finetune for a lot of epochs;if you don't want a big initial loss penalty to overcome, you can stick with a much lower dropout than I used (e.g. single digits, even low single digits). Anyway, as far as I'm concerned, this can now be closed as successfully implemented! |
🔖 Feature description
Full finetune suffers badly from epoch spikes, making any training lasting past the end of an epoch (and esp. 2 or more epochs) difficult to get any further progress from it. A deeper understanding of the data should be able to be achieved with dropout. But while there's lora_dropout, we don't have any dropout option available for full finetune. Anyway we could get that added?
✔️ Solution
Add dropout
❓ Alternatives
lora_dropout only applies to LoRAs.
📝 Additional Context
No response
Acknowledgements
The text was updated successfully, but these errors were encountered: