Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

[Contribution] DeepSpeed Integration #4634

Closed
jacobdanovitch opened this issue Sep 12, 2020 · 32 comments
Closed

[Contribution] DeepSpeed Integration #4634

jacobdanovitch opened this issue Sep 12, 2020 · 32 comments

Comments

@jacobdanovitch
Copy link
Contributor

jacobdanovitch commented Sep 12, 2020

DeepSpeed background

DeepSpeed is a distributed training engine for PyTorch, primarily for training very large language models with significantly less memory. For example, the 17.7 billion parameter Turing-NLG was trained with DeepSpeed's ZeRO optimizer.

Proposal

It seems like a natural fit to have a way to use this with AllenNLP for use with large, distributed experiments. It also shouldn't require any major changes to integrate. Their training loop looks like:

# https://www.deepspeed.ai/getting-started/#training
model_engine, optimizer, _, _ = deepspeed.initialize(args=cmd_args,
                                                     model=model,
                                                     model_parameters=params)
for step, batch in enumerate(data_loader):
    #forward() method
    loss = model_engine(batch)

    #runs backpropagation
    model_engine.backward(loss)

    #weight update
    model_engine.step()

In terms of where it would fit into the library, I think a standalone DeepSpeedTrainer(Trainer) subclass would make sense. It should be fairly similar to GradientDescentTrainer (minus stuff that DeepSpeed handles itself, like gradient accumulation). It could then be initialized from a config file by the user as per usual.

I know not having dependencies on other libraries is a point of emphasis. It should be possible to include without adding deepspeed as a dependency and allowing the user to install it independently by doing something like:

# allennlp.training.__init__.py
# ...
try:
  from allennlp.training.deepspeed_trainer import DeepSpeedTrainer
except ImportError:
  pass # maybe a warning here or something

Initial results

I was able to get a prototype up and running pretty easily. I didn't subclass GradientDescentTrainer (I had a lot of trouble doing that, for whatever reason), but I just copied and pasted the code and started ripping stuff out as I went.

I setup a training experiment for a basic classifier on the first 10k instances of SST using RoBERTA-base across two GPUs. The GradientDescentTrainer completed an epoch in 20.40s, using 8936MB / 10202MB of GPU memory. The DeepSpeedTrainer prototype completed an epoch in 46.91s, using just 4184MB / 4348MB of GPU memory (less than half!). I don't know why it took so much longer but I strongly assume it's something I implemented wrong myself.

The repo for this prototype is here.

Potential obstacles

  • I have plenty of time to implement this if would be a useful addition, but I have little to no idea what I'm doing, I'm not particularly experienced with heavy distributed training stuff.
  • With all due respect, their library could maybe be a bit better documented and is seriously challenging to install and get everything compiled just right.
    • That said, a lot of the latter point might be a product of my setup. I'm working on SLURM instead of a personal VM which makes using their Docker image or installing from source harder than it really is.

Next steps

I think this could be a useful addition if (1) it's really halving GPU memory for transformer models and (2) it can be implemented non-intrusively. If you guys agree, I can move my prototype code from my repository into an actual PR.

@epwalsh
Copy link
Member

epwalsh commented Sep 14, 2020

Hi @jacobdanovitch, being able to integrate with DeepSpeed would be awesome.

I'm also worried about adding a new dependency and a new part of the library to maintain though. Another option is to keep this trainer in a separate repository as an official "AllenNLP plugin".

@AkshitaB, @dirkgr, @matt-gardner what are your thoughts?

@matt-gardner
Copy link
Contributor

If we had a nice place for advertising plugins and what's available, I would probably vote for a plugin for this kind of thing. If we ever get around to implementing a pytorch lightning trainer, I would similarly vote for having it as a plugin. I think that's a good option for integrations that depend on large third-party libraries.

@dirkgr
Copy link
Member

dirkgr commented Sep 21, 2020

If there is a way to avoid making DeepSpeed a real dependency in setup.py, I'd be open to having it in the main library.

@jacobdanovitch, can you run that "Initial results" test you did with amp enabled? I wonder if that explains most of the advantage. If that's the case, we need some other capability or improvement to come from this to make it worthwhile.

DeepSpeed's big thing is distributed training, including automatic model-parallel training, right? We have neither model-parallel nor multi-machine training right now. It would be big if we could add that as a capability. @jacobdanovitch, before you spend a bunch of time wrapping up DeepSpeed nicely, can you try to get your prototype to train models in that setting, and get some numbers? I'm happy to follow up with details if you like!

How stable is DeepSpeed as a dependency? Are they still in a phase where their API changes with every release?

@epwalsh
Copy link
Member

epwalsh commented Sep 21, 2020

If there is a way to avoid making DeepSpeed a real dependency in setup.py, I'd be open to having it in the main library.

We could put this dependency under an optional feature: https://setuptools.readthedocs.io/en/latest/setuptools.html#declaring-extras-optional-features-with-their-own-dependencies

@jacobdanovitch
Copy link
Contributor Author

jacobdanovitch commented Sep 21, 2020

@dirkgr

@jacobdanovitch, can you run that "Initial results" test you did with amp enabled? I wonder if that explains most of the advantage. If that's the case, we need some other capability or improvement to come from this to make it worthwhile.

Sure. So, Deepspeed has a few different modes. They have NVIDIA AMP, and they have the "ZeRO memory optimization wrapper for FP16 Training" (which is not compatible with NVIDIA AMP). I ran on both as well as with deepspeed disabled entirely (but still using my trainer).

Ran 1 epoch on SST using roberta-base and a batch size of 64.

Hardware:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57       Driver Version: 450.57       CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 8000     Off  | 00000000:3A:00.0 Off |                  Off |
| 30%   34C    P0    62W / 260W |      0MiB / 48601MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Quadro RTX 8000     Off  | 00000000:88:00.0 Off |                  Off |
| 28%   32C    P0    24W / 260W |      0MiB / 48601MiB |      5%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Here are the results:

Trainer AMP/FP16 GPU0-MB GPU1-MB
Gradient Descent 7952 9184
Gradient Descent ✅ (use_amp=true) 6638 7128
Deepspeed ✅ (AMP) 5928 5978
Deepspeed 6894 6966
Deepspeed-ZeRO ✅ (FP16) 4152 4124

Still not sure if I have everything wired up correctly but the results seem to be pretty reproducible as far as I can tell. Let me know if there's anything else you can try. The results seem to make sense at first glance. I'm not sure why the base trainer does much worse than the Deepspeed trainer with everything disabled, but maybe even with everything disabled, their optimizer is doing something particularly effective to conserve memory. That would also explain why Deepspeed+AMP slightly outperforms GD+use_amp=true even though they're using NVIDIA/legacy AMP.

DeepSpeed's big thing is distributed training, including automatic model-parallel training, right? We have neither model-parallel nor multi-machine training right now. It would be big if we could add that as a capability. @jacobdanovitch, before you spend a bunch of time wrapping up DeepSpeed nicely, can you try to get your prototype to train models in that setting, and get some numbers? I'm happy to follow up with details if you like!

Which setting are you referring to, model parallel/multi-machine? I thought I'd run some experiments with allennlp on multiple nodes before, but maybe I'm misremembering. If that doesn't exist yet, maybe deepspeed could be one of a few different backends for that (like how Lightning has horovod, ddp, ddp2, etc).

How stable is DeepSpeed as a dependency? Are they still in a phase where their API changes with every release?

Their API seems stable, it hasn't changed since release seemingly. They're releasing some pretty large new features periodically, but it seems like it's all handled through their config files (which I made a FromParams object for).

@jacobdanovitch
Copy link
Contributor Author

We could put this dependency under an optional feature: https://setuptools.readthedocs.io/en/latest/setuptools.html#declaring-extras-optional-features-with-their-own-dependencies

I was going to suggest something like as well. Is this what some libraries use for installations like pip install library[feature1, feature2]?

I'm not sure how it works, but would this let users install the base library with pip install allennlp, excluding files like a potential deepspeed trainer by default? And then pip install allennlp[deepspeed,...] to select which plugins they'd like. This could be a good way to handle plugins without forcing them on users.

@epwalsh
Copy link
Member

epwalsh commented Sep 21, 2020

@jacobdanovitch yeup, exactly!

@dirkgr
Copy link
Member

dirkgr commented Sep 24, 2020

@jacobdanovitch, those are great results. What are the timings for those, i.e., how quickly did they get done? What model did you use to do this?

Do you have NVLink between the two cards?

@jacobdanovitch
Copy link
Contributor Author

@jacobdanovitch, those are great results. What are the timings for those, i.e., how quickly did they get done?

Training is slow right now. FP16 trainer took about ~9s on average to get through 3520 instances at a batch size of 64. Deepspeed trainer takes a lot longer, about ~20-23s, but there's a lot of stuff going on in their model engine (logging, timing, tensorboard writing, etc). I'm sort of stuffing a trainer inside another trainer right now, so this should get faster when I wrap it nicely. Also, a lot of their speed-related gains come from their sparse attention kernels, which I'm not focusing on yet.

What model did you use to do this?

Basic RoBERTa config for SST.

local transformer_model = "roberta-base";

{
    "type": "basic_classifier",
    "text_field_embedder": {
        "token_embedders": {
            "tokens": {
                "type": "pretrained_transformer",
                "model_name": transformer_model
            }
        }
    },
    "seq2vec_encoder": {
        "type": "bert_pooler",
        "pretrained_model": transformer_model,
        "dropout": 0.1
    },
}

Do you have NVLink between the two cards?

No. Our cluster has some NVLink-compatible cards but I was having trouble getting everything compiled properly for them. I can try again if you'd like.

@dirkgr
Copy link
Member

dirkgr commented Sep 24, 2020

That all sounds promising, though some risks remain. If it was me, I'd want to make sure that the performance improvements actually materialize. At the end of the day, we want to have some new capability, i.e., either training faster, or at least training bigger models in the same amount of memory. But there is also a time when the investigation starts to take longer than just doing the thing. I'll trust your judgement on that.

@jacobdanovitch
Copy link
Contributor Author

jacobdanovitch commented Sep 24, 2020

What would be the best way to move forward on this? Should I start working it into a PR? Not sure what the consensus is on if it belongs in the main library or a plugin library.

@dirkgr
Copy link
Member

dirkgr commented Sep 24, 2020

I'll say let's make it a separate trainer in the main library. If it turns out it's too weird with dependencies or something, then we'll move it into a plugin later.

@bratao
Copy link
Contributor

bratao commented Sep 30, 2020

@jacobdanovitch did you have any update on this? Sorry for pinging you, but I´m right now evaluating the Deepspeed + Allennlp thanks to their Sparse Transformer implementation and would be nice to have a head start on this.

@jacobdanovitch
Copy link
Contributor Author

@jacobdanovitch did you have any update on this? Sorry for pinging you, but I´m right now evaluating the Deepspeed + Allennlp thanks to their Sparse Transformer implementation and would be nice to have a head start on this.

Hey, no worries, I was gonna check in tonight anyway. I haven't had a second to open a PR so far this week but there's the repo that I linked to in my OP if you're looking to try it out, it's just two python files and a config or whatever (I forget if I pushed a few changes though). Feel free to email me too, I can help you get started.

@jacobdanovitch
Copy link
Contributor Author

First draft of this is linked above. I tried my best to get multi-node working as well but I can't even get it to work with the regular trainer (this is SLURM's fault, not allennlp). It might work OOB if you setup your config file properly with num_nodes, master_address, etc (as you would for the regular trainer), but for the moment i don't have a non-infuriating way of testing this.

Single node/multi GPU is entirely functional and all the benchmarks match what I reported (I mostly just copied and pasted the code from my repo into training/deepspeed_trainer.py). More concerned with if I've hooked everything together correctly. Right now I'm basically just replacing the trainer's model object with the deepspeed model_engine, which is one way to go about it. The issue is as I said above, it's slow because it has a bunch of superfluous logging/monitoring stuff that we wouldn't need. So to avoid that I'll look into side-stepping their engine.

Of note: It's quite fast for a little while after starting training, and then at some point slows to a crawl. Not sure if that's because of the monitoring or something else.

@dirkgr
Copy link
Member

dirkgr commented Oct 6, 2020

What's next for this feature? Did you manage to address the slowdown?

@jacobdanovitch
Copy link
Contributor Author

Did you manage to address the slowdown?

@dirkgr I think I might have finally identified the source of it. The slowdown (1) happens at about the same point every run, (2) isn't as much of a "slowdown" per se as it is hanging, and (3) seems to directly correlate to the number of gradient accumulation steps. My very uneducated guess would be that communication (in general) is the bottleneck here, and that accumulation alleviates this. A few possibilities on why that may be:

  • I'm not using NVLink
  • It's something to do with my cluster and I'm the only one experiencing this
  • They launch with subprocesses whereas allennlp launches with mp.spawn

When using as few as 4 steps of accumulation, the slowdown ranges from tolerable to almost negligible and is (imo) a perfectly fair compromise for the large savings in GPU usage (which remain quite similar).

What's next for this feature?

  1. It would be really helpful if someone else could try this on their machines to see if the bad communication speed is just a product of my cluster. If no one else can, I can setup a couple VMs, but maybe @bratao could give it a shot if he's already been experimenting?
  2. Where in the library should this go? I'm going to break things out into a bunch of registrables and such to make it a lot cleaner and I don't want to pollute allennlp.training. Would it be best in allennlp.training.deepspeed, or maybe allennlp.contrib should be added? @matt-gardner @epwalsh This is related to the discussion on how to avoid making it a dependency.

@dirkgr
Copy link
Member

dirkgr commented Oct 8, 2020

Don't worry about not making it a dependency yet. Let's get it to work first and then we'll see. Same goes for the namespace. Just put it into allennlp.training.deepspeed right now. If we have to move it later, we will do that. Those two things are not the hard problems to solve. Moving and renaming is easy :-)

@github-actions
Copy link

This issue is being closed due to lack of activity. If you think it still needs to be addressed, please comment on this thread 👇

@jacobdanovitch
Copy link
Contributor Author

Whoops, still working on this, sorry. Have been busy with WWW coming up. I've taken the relevant bits of training code down from about 1500 LOC to 500 but still working on it. Lot to unpack to make stuff registrable properly.

@jacobdanovitch
Copy link
Contributor Author

Sorry this has taken so long but I have some actual initial results and should have a PR ready next week. I setup a wandb logger and created a report (link) comparing the existing gradient descent trainer with/without FP16 to deepspeed for those interested.

Overall, the gradient descent trainer used 7.4GB, versus deepspeed stage 2 which used 4.3GB and seemed to use less power (watts/heat wise). Deepspeed also seemed to train better, for some reason, though I wouldn't put much stock into that.

@dirkgr
Copy link
Member

dirkgr commented Nov 2, 2020

I was on vacation, but I'm back now. I'll take a look!

@dirkgr dirkgr reopened this Nov 2, 2020
@dirkgr
Copy link
Member

dirkgr commented Nov 2, 2020

In that link, it looks like only the DeepSpeed stages learned anything, and the others stayed random. Am I reading that wrong?

@jacobdanovitch
Copy link
Contributor Author

In that link, it looks like only the DeepSpeed stages learned anything, and the others stayed random. Am I reading that wrong?

Yeah after looking into it I just needed to turn down the learning rate for the others and they get their accuracy to 93+ as well. I just had those plots in there to show that deepspeed does in fact converge (as FP16 training can be unstable sometimes), sorry for the confusion.

So in general, performance is even across all of them as suspected, and memory consumption is significantly lower with deepspeed.

@dirkgr
Copy link
Member

dirkgr commented Nov 4, 2020

The runtimes are really short. This ran for only 2.5 minutes?

@jacobdanovitch
Copy link
Contributor Author

The runtimes are really short. This ran for only 2.5 minutes?

Yes. This is on SST which is only about 10k instances iirc. Batch size 64; I think the config should be logged for each experiment there. I can run on a larger dataset, with bigger batch sizes, more epochs, varying levels of gradient accumulation, etc. if you'd like.

@dirkgr
Copy link
Member

dirkgr commented Nov 4, 2020

I don't remember how long SST takes to load, but I'd worry that it spends most of its time loading data, and so we don't get good numbers with such a short runtime. On the machines I typically run, it spends easy 10s just doing import. That's 7% of the whole runtime.

@jacobdanovitch
Copy link
Contributor Author

I cranked it up a bit with SNLI. Charts here. TLDR:

  • Still RoBERTa base
  • First 50k instances only (can do more but will obviously take time)
  • Batch size 512, 5 epochs
  • Only compared FP16 vs deepspeed this time
  • Both converge
  • Both are similarly fast in terms of s/it
  • Peak GPU usage favors deepspeed, 23GB to 36GB

For reference, Wandb starts logging when my EpochCallback gets called with epoch == -1, so after everything before that point has been loaded.

Happy to run any other configurations as well. Also gonna move my PR from draft to ready for review.

@github-actions
Copy link

This issue is being closed due to lack of activity. If you think it still needs to be addressed, please comment on this thread 👇

@dirkgr
Copy link
Member

dirkgr commented Nov 25, 2020

This is not closed. @jacobdanovitch is working on it in #4693!

@dirkgr dirkgr reopened this Nov 25, 2020
@github-actions
Copy link

github-actions bot commented Dec 4, 2020

This issue is being closed due to lack of activity. If you think it still needs to be addressed, please comment on this thread 👇

@github-actions github-actions bot closed this as completed Dec 4, 2020
@bratao
Copy link
Contributor

bratao commented Dec 4, 2020

This is not closed. @jacobdanovitch is working on it in #4693!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants