-
Notifications
You must be signed in to change notification settings - Fork 2.3k
[Contribution] DeepSpeed Integration #4634
Comments
Hi @jacobdanovitch, being able to integrate with DeepSpeed would be awesome. I'm also worried about adding a new dependency and a new part of the library to maintain though. Another option is to keep this trainer in a separate repository as an official "AllenNLP plugin". @AkshitaB, @dirkgr, @matt-gardner what are your thoughts? |
If we had a nice place for advertising plugins and what's available, I would probably vote for a plugin for this kind of thing. If we ever get around to implementing a pytorch lightning trainer, I would similarly vote for having it as a plugin. I think that's a good option for integrations that depend on large third-party libraries. |
If there is a way to avoid making DeepSpeed a real dependency in @jacobdanovitch, can you run that "Initial results" test you did with amp enabled? I wonder if that explains most of the advantage. If that's the case, we need some other capability or improvement to come from this to make it worthwhile. DeepSpeed's big thing is distributed training, including automatic model-parallel training, right? We have neither model-parallel nor multi-machine training right now. It would be big if we could add that as a capability. @jacobdanovitch, before you spend a bunch of time wrapping up DeepSpeed nicely, can you try to get your prototype to train models in that setting, and get some numbers? I'm happy to follow up with details if you like! How stable is DeepSpeed as a dependency? Are they still in a phase where their API changes with every release? |
We could put this dependency under an optional feature: https://setuptools.readthedocs.io/en/latest/setuptools.html#declaring-extras-optional-features-with-their-own-dependencies |
Sure. So, Deepspeed has a few different modes. They have NVIDIA AMP, and they have the "ZeRO memory optimization wrapper for FP16 Training" (which is not compatible with NVIDIA AMP). I ran on both as well as with deepspeed disabled entirely (but still using my trainer). Ran 1 epoch on SST using Hardware:
Here are the results:
Still not sure if I have everything wired up correctly but the results seem to be pretty reproducible as far as I can tell. Let me know if there's anything else you can try. The results seem to make sense at first glance. I'm not sure why the base trainer does much worse than the Deepspeed trainer with everything disabled, but maybe even with everything disabled, their optimizer is doing something particularly effective to conserve memory. That would also explain why Deepspeed+AMP slightly outperforms GD+
Which setting are you referring to, model parallel/multi-machine? I thought I'd run some experiments with allennlp on multiple nodes before, but maybe I'm misremembering. If that doesn't exist yet, maybe deepspeed could be one of a few different backends for that (like how Lightning has
Their API seems stable, it hasn't changed since release seemingly. They're releasing some pretty large new features periodically, but it seems like it's all handled through their config files (which I made a |
I was going to suggest something like as well. Is this what some libraries use for installations like I'm not sure how it works, but would this let users install the base library with |
@jacobdanovitch yeup, exactly! |
@jacobdanovitch, those are great results. What are the timings for those, i.e., how quickly did they get done? What model did you use to do this? Do you have NVLink between the two cards? |
Training is slow right now. FP16 trainer took about ~9s on average to get through 3520 instances at a batch size of 64. Deepspeed trainer takes a lot longer, about ~20-23s, but there's a lot of stuff going on in their model engine (logging, timing, tensorboard writing, etc). I'm sort of stuffing a trainer inside another trainer right now, so this should get faster when I wrap it nicely. Also, a lot of their speed-related gains come from their sparse attention kernels, which I'm not focusing on yet.
Basic RoBERTa config for SST. local transformer_model = "roberta-base";
{
"type": "basic_classifier",
"text_field_embedder": {
"token_embedders": {
"tokens": {
"type": "pretrained_transformer",
"model_name": transformer_model
}
}
},
"seq2vec_encoder": {
"type": "bert_pooler",
"pretrained_model": transformer_model,
"dropout": 0.1
},
}
No. Our cluster has some NVLink-compatible cards but I was having trouble getting everything compiled properly for them. I can try again if you'd like. |
That all sounds promising, though some risks remain. If it was me, I'd want to make sure that the performance improvements actually materialize. At the end of the day, we want to have some new capability, i.e., either training faster, or at least training bigger models in the same amount of memory. But there is also a time when the investigation starts to take longer than just doing the thing. I'll trust your judgement on that. |
What would be the best way to move forward on this? Should I start working it into a PR? Not sure what the consensus is on if it belongs in the main library or a plugin library. |
I'll say let's make it a separate trainer in the main library. If it turns out it's too weird with dependencies or something, then we'll move it into a plugin later. |
@jacobdanovitch did you have any update on this? Sorry for pinging you, but I´m right now evaluating the Deepspeed + Allennlp thanks to their Sparse Transformer implementation and would be nice to have a head start on this. |
Hey, no worries, I was gonna check in tonight anyway. I haven't had a second to open a PR so far this week but there's the repo that I linked to in my OP if you're looking to try it out, it's just two python files and a config or whatever (I forget if I pushed a few changes though). Feel free to email me too, I can help you get started. |
First draft of this is linked above. I tried my best to get multi-node working as well but I can't even get it to work with the regular trainer (this is SLURM's fault, not allennlp). It might work OOB if you setup your config file properly with Single node/multi GPU is entirely functional and all the benchmarks match what I reported (I mostly just copied and pasted the code from my repo into Of note: It's quite fast for a little while after starting training, and then at some point slows to a crawl. Not sure if that's because of the monitoring or something else. |
What's next for this feature? Did you manage to address the slowdown? |
@dirkgr I think I might have finally identified the source of it. The slowdown (1) happens at about the same point every run, (2) isn't as much of a "slowdown" per se as it is hanging, and (3) seems to directly correlate to the number of gradient accumulation steps. My very uneducated guess would be that communication (in general) is the bottleneck here, and that accumulation alleviates this. A few possibilities on why that may be:
When using as few as 4 steps of accumulation, the slowdown ranges from tolerable to almost negligible and is (imo) a perfectly fair compromise for the large savings in GPU usage (which remain quite similar).
|
Don't worry about not making it a dependency yet. Let's get it to work first and then we'll see. Same goes for the namespace. Just put it into |
This issue is being closed due to lack of activity. If you think it still needs to be addressed, please comment on this thread 👇 |
Whoops, still working on this, sorry. Have been busy with WWW coming up. I've taken the relevant bits of training code down from about 1500 LOC to 500 but still working on it. Lot to unpack to make stuff registrable properly. |
Sorry this has taken so long but I have some actual initial results and should have a PR ready next week. I setup a wandb logger and created a report (link) comparing the existing gradient descent trainer with/without FP16 to deepspeed for those interested. Overall, the gradient descent trainer used 7.4GB, versus deepspeed stage 2 which used 4.3GB and seemed to use less power (watts/heat wise). Deepspeed also seemed to train better, for some reason, though I wouldn't put much stock into that. |
I was on vacation, but I'm back now. I'll take a look! |
In that link, it looks like only the DeepSpeed stages learned anything, and the others stayed random. Am I reading that wrong? |
Yeah after looking into it I just needed to turn down the learning rate for the others and they get their accuracy to 93+ as well. I just had those plots in there to show that deepspeed does in fact converge (as FP16 training can be unstable sometimes), sorry for the confusion. So in general, performance is even across all of them as suspected, and memory consumption is significantly lower with deepspeed. |
The runtimes are really short. This ran for only 2.5 minutes? |
Yes. This is on SST which is only about 10k instances iirc. Batch size 64; I think the config should be logged for each experiment there. I can run on a larger dataset, with bigger batch sizes, more epochs, varying levels of gradient accumulation, etc. if you'd like. |
I don't remember how long SST takes to load, but I'd worry that it spends most of its time loading data, and so we don't get good numbers with such a short runtime. On the machines I typically run, it spends easy 10s just doing |
I cranked it up a bit with SNLI. Charts here. TLDR:
For reference, Wandb starts logging when my Happy to run any other configurations as well. Also gonna move my PR from draft to ready for review. |
This issue is being closed due to lack of activity. If you think it still needs to be addressed, please comment on this thread 👇 |
This is not closed. @jacobdanovitch is working on it in #4693! |
This issue is being closed due to lack of activity. If you think it still needs to be addressed, please comment on this thread 👇 |
This is not closed. @jacobdanovitch is working on it in #4693! |
DeepSpeed background
DeepSpeed is a distributed training engine for PyTorch, primarily for training very large language models with significantly less memory. For example, the 17.7 billion parameter Turing-NLG was trained with DeepSpeed's ZeRO optimizer.
Proposal
It seems like a natural fit to have a way to use this with AllenNLP for use with large, distributed experiments. It also shouldn't require any major changes to integrate. Their training loop looks like:
In terms of where it would fit into the library, I think a standalone
DeepSpeedTrainer(Trainer)
subclass would make sense. It should be fairly similar toGradientDescentTrainer
(minus stuff that DeepSpeed handles itself, like gradient accumulation). It could then be initialized from a config file by the user as per usual.I know not having dependencies on other libraries is a point of emphasis. It should be possible to include without adding deepspeed as a dependency and allowing the user to install it independently by doing something like:
Initial results
I was able to get a prototype up and running pretty easily. I didn't subclass
GradientDescentTrainer
(I had a lot of trouble doing that, for whatever reason), but I just copied and pasted the code and started ripping stuff out as I went.I setup a training experiment for a basic classifier on the first 10k instances of SST using RoBERTA-base across two GPUs. The
GradientDescentTrainer
completed an epoch in 20.40s, using 8936MB / 10202MB of GPU memory. TheDeepSpeedTrainer
prototype completed an epoch in 46.91s, using just 4184MB / 4348MB of GPU memory (less than half!). I don't know why it took so much longer but I strongly assume it's something I implemented wrong myself.The repo for this prototype is here.
Potential obstacles
Next steps
I think this could be a useful addition if (1) it's really halving GPU memory for transformer models and (2) it can be implemented non-intrusively. If you guys agree, I can move my prototype code from my repository into an actual PR.
The text was updated successfully, but these errors were encountered: