-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIR] Init Mosaic Trainer API #29237
[AIR] Init Mosaic Trainer API #29237
Conversation
Signed-off-by: ilee300a <[email protected]>
Signed-off-by: ilee300a <[email protected]>
…/ray into init_mosaic_trainer_api Signed-off-by: ilee300a <[email protected]>
…/ray into init_mosaic_trainer_api
os.environ["WORLD_SIZE"] = str(session.get_world_size()) | ||
os.environ["LOCAL_RANK"] = str(session.get_local_rank()) | ||
|
||
# Arbitrary values set for these as they are needed for some composer functions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some rationale why arbitrary values won't affect functionality?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These values are normally used for file formatting and Deepspeed. I can remove these for now, since these values are not used for most cases and later set these value in later iteration when we actually need them.
Signed-off-by: ilee300a <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking great and much easier to review ! Mostly nits from my side.
Signed-off-by: ilee300a <[email protected]>
Signed-off-by: ilee300a <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ilee300a this looks great! Left some comments
Signed-off-by: ilee300a <[email protected]>
Signed-off-by: ilee300a <[email protected]>
Signed-off-by: ilee300a <[email protected]>
Signed-off-by: ilee300a <[email protected]>
Signed-off-by: ilee300a <[email protected]>
Signed-off-by: ilee300a <[email protected]>
Signed-off-by: ilee300a <[email protected]>
Signed-off-by: ilee300a <[email protected]>
Signed-off-by: ilee300a <[email protected]>
Signed-off-by: ilee300a <[email protected]>
Signed-off-by: ilee300a <[email protected]>
Signed-off-by: ilee300a <[email protected]>
Signed-off-by: ilee300a <[email protected]>
what are remaining items in this PR to make it mergable? |
There is a failing test due to mosaic composer import. (https://buildkite.com/ray-project/oss-ci-build-pr/builds/2778#0183f75e-a2b9-4a06-8fc2-8192ac3990af) |
For the CI test failure, you can see it's being kicked off from https://sourcegraph.com/github.com/ray-project/ray/-/blob/.buildkite/pipeline.ml.yml?L12-23 With TRAIN_TESTING flag set to 1, we will then install https://sourcegraph.com/github.com/ray-project/ray/-/blob/ci/env/install-dependencies.sh?L362 That you can just add mosaic deps to https://sourcegraph.com/github.com/ray-project/ray/-/blob/python/requirements/ml/requirements_train.txt?L16 ===== chatted via slack === runtime env is our best bet when base torch/numpy version conflicts :/ |
Signed-off-by: ilee300a <[email protected]>
Signed-off-by: ilee300a <[email protected]>
Signed-off-by: ilee300a <[email protected]>
…/ray into init_mosaic_trainer_api
Signed-off-by: ilee300a <[email protected]>
Signed-off-by: Amog Kamsetty <[email protected]>
Signed-off-by: Amog Kamsetty <[email protected]>
Signed-off-by: Amog Kamsetty <[email protected]>
Signed-off-by: Amog Kamsetty <[email protected]>
Signed-off-by: Amog Kamsetty <[email protected]>
Signed-off-by: Amog Kamsetty <[email protected]>
Signed-off-by: Amog Kamsetty <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like we're good to merge this initial milestone :) ?
Signed-off-by: ilee300a [[email protected]](mailto:[email protected]) In this PR, we provide initial commits for integrating Mosaic library with Ray. As Mosaic library provides algorithmic acceleration, providing further acceleration from the system side via Ray's distributed training, we can improve the speedup of training process. Included in this PR is MosaicTrainer skeleton code. For this PR, the trainer does not support using ray dataset shards in the worker loop and assumes that the data loaders are prepared in the trainer init function. The current trainer is able to run a composer model with callbacks and loggers as well as select algorithms. No metrics or checkpoints are reported from the trainer at the moment. Co-authored-by: Amog Kamsetty <[email protected]> Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: ilee300a [email protected]
In this PR, we provide initial commits for integrating Mosaic library with Ray. As Mosaic library provides algorithmic acceleration, providing further acceleration from the system side via Ray's distributed training, we can improve the speedup of training process.
Included in this PR is MosaicTrainer skeleton code. For this PR, the trainer does not support using ray dataset shards in the worker loop and assumes that the data loaders are prepared in the trainer init function. The current trainer is able to run a composer model with callbacks and loggers as well as select algorithms. No metrics or checkpoints are reported from the trainer at the moment.
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.