Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIR] Init Mosaic Trainer API #29237

Merged
merged 61 commits into from
Oct 25, 2022

Conversation

ilee300a
Copy link
Contributor

@ilee300a ilee300a commented Oct 11, 2022

Signed-off-by: ilee300a [email protected]

In this PR, we provide initial commits for integrating Mosaic library with Ray. As Mosaic library provides algorithmic acceleration, providing further acceleration from the system side via Ray's distributed training, we can improve the speedup of training process.

Included in this PR is MosaicTrainer skeleton code. For this PR, the trainer does not support using ray dataset shards in the worker loop and assumes that the data loaders are prepared in the trainer init function. The current trainer is able to run a composer model with callbacks and loggers as well as select algorithms. No metrics or checkpoints are reported from the trainer at the moment.

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@ilee300a ilee300a changed the title Init Mosaic Trainer API [AIR] Init Mosaic Trainer API Oct 12, 2022
python/ray/train/mosaic/mosaic_trainer.py Outdated Show resolved Hide resolved
python/ray/train/mosaic/mosaic_trainer.py Show resolved Hide resolved
os.environ["WORLD_SIZE"] = str(session.get_world_size())
os.environ["LOCAL_RANK"] = str(session.get_local_rank())

# Arbitrary values set for these as they are needed for some composer functions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some rationale why arbitrary values won't affect functionality?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These values are normally used for file formatting and Deepspeed. I can remove these for now, since these values are not used for most cases and later set these value in later iteration when we actually need them.

python/ray/train/tests/test_mosaic_trainer.py Show resolved Hide resolved
python/ray/train/tests/test_mosaic_trainer.py Outdated Show resolved Hide resolved
Signed-off-by: ilee300a <[email protected]>
Copy link
Member

@jiaodong jiaodong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking great and much easier to review ! Mostly nits from my side.

python/ray/train/tests/test_mosaic_trainer.py Outdated Show resolved Hide resolved
python/ray/train/tests/test_mosaic_trainer.py Outdated Show resolved Hide resolved
python/ray/train/tests/test_mosaic_trainer.py Outdated Show resolved Hide resolved
python/ray/train/tests/test_mosaic_trainer.py Outdated Show resolved Hide resolved
Copy link
Contributor

@amogkam amogkam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ilee300a this looks great! Left some comments

python/ray/train/mosaic/mosaic_trainer.py Outdated Show resolved Hide resolved
python/ray/train/mosaic/mosaic_trainer.py Outdated Show resolved Hide resolved
python/ray/train/mosaic/mosaic_trainer.py Outdated Show resolved Hide resolved
python/ray/train/mosaic/mosaic_trainer.py Show resolved Hide resolved
python/ray/train/mosaic/mosaic_trainer.py Show resolved Hide resolved
python/ray/train/tests/test_mosaic_trainer.py Show resolved Hide resolved
python/ray/train/tests/test_mosaic_trainer.py Outdated Show resolved Hide resolved
python/ray/train/tests/test_mosaic_trainer.py Outdated Show resolved Hide resolved
python/ray/train/tests/test_mosaic_trainer.py Outdated Show resolved Hide resolved
python/ray/train/tests/test_mosaic_trainer.py Outdated Show resolved Hide resolved
@jiaodong
Copy link
Member

what are remaining items in this PR to make it mergable?

@ilee300a
Copy link
Contributor Author

what are remaining items in this PR to make it mergable?

There is a failing test due to mosaic composer import. (https://buildkite.com/ray-project/oss-ci-build-pr/builds/2778#0183f75e-a2b9-4a06-8fc2-8192ac3990af)

@jiaodong
Copy link
Member

jiaodong commented Oct 20, 2022

For the CI test failure, you can see it's being kicked off from https://sourcegraph.com/github.com/ray-project/ray/-/blob/.buildkite/pipeline.ml.yml?L12-23

With TRAIN_TESTING flag set to 1, we will then install https://sourcegraph.com/github.com/ray-project/ray/-/blob/ci/env/install-dependencies.sh?L362

That you can just add mosaic deps to https://sourcegraph.com/github.com/ray-project/ray/-/blob/python/requirements/ml/requirements_train.txt?L16

===== chatted via slack ===

runtime env is our best bet when base torch/numpy version conflicts :/

.buildkite/pipeline.ml.yml Outdated Show resolved Hide resolved
.buildkite/pipeline.ml.yml Outdated Show resolved Hide resolved
amogkam and others added 3 commits October 24, 2022 09:42
Signed-off-by: Amog Kamsetty <[email protected]>
Signed-off-by: Amog Kamsetty <[email protected]>
Copy link
Member

@jiaodong jiaodong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like we're good to merge this initial milestone :) ?

@amogkam amogkam merged commit 51a7b09 into ray-project:master Oct 25, 2022
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
Signed-off-by: ilee300a [[email protected]](mailto:[email protected])

In this PR, we provide initial commits for integrating Mosaic library with Ray. As Mosaic library provides algorithmic acceleration, providing further acceleration from the system side via Ray's distributed training, we can improve the speedup of training process.

Included in this PR is MosaicTrainer skeleton code. For this PR, the trainer does not support using ray dataset shards in the worker loop and assumes that the data loaders are prepared in the trainer init function. The current trainer is able to run a composer model with callbacks and loggers as well as select algorithms. No metrics or checkpoints are reported from the trainer at the moment.

Co-authored-by: Amog Kamsetty <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants