Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HPU] [Train] Add a Stable Diffusion fine-tuning and serving example #45217

Merged
merged 10 commits into from
May 30, 2024

Conversation

kira-lin
Copy link
Contributor

@kira-lin kira-lin commented May 9, 2024

Why are these changes needed?

This PR adds an example for stable diffusion model fine-tuning and serving using HPU. Moreover, it also covers how to adapt an existing HPU example to run on Ray, so that users can use Ray to run the examples on huggingface/optimum-habana.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Collaborator

@aslonnie aslonnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(for train/tune team to review)

@anyscalesam anyscalesam added triage Needs triage (eg: priority, bug/not-bug, and owning component) train Ray Train Related Issue labels May 9, 2024
@woshiyyya woshiyyya self-assigned this May 9, 2024
@woshiyyya woshiyyya removed the triage Needs triage (eg: priority, bug/not-bug, and owning component) label May 9, 2024
" main()\n",
"```\n",
"\n",
"Originally, this script will be started by MPI if multiple workers are used. But with Ray, we should setup TorchTrainer and supply a main function, which is `main()` in this example.\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love these step-by-step explanations! They are crystal clear:)

doc/source/train/examples/intel_gaudi/sd.ipynb Outdated Show resolved Hide resolved
@woshiyyya
Copy link
Member

woshiyyya commented May 10, 2024

Let's also add this example to doc/source/train/examples.yml. You can refer to this PR: https://github.com/ray-project/ray/pull/44667/files.

Also, could you include the required libraries with versions, so that users can reproduce on their own?

Copy link
Member

@woshiyyya woshiyyya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @kira-lin ! Generally looks good to me. Left some comments.

Copy link
Member

@woshiyyya woshiyyya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please adopt the suggested changes to pass the doc build.

doc/source/train/examples.yml Show resolved Hide resolved
@woshiyyya
Copy link
Member

Screenshot 2024-05-16 at 1 08 50 PM

Seems to have a syntax error in the notebook. Can you fix that?

Copy link
Contributor

@peytondmurray peytondmurray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of things:

  1. You start sd.ipynb with (hpu_sd_finetune)=. I'm not sure what is intended here, but this tells Sphinx to create a new anchor at the top of the page. IMO this is clutter, because sphinx already lets you reference the file itself automatically: instead of referencing hpu_sd_finetune, just make a reference to train/examples/intel_gaudi/sd and sphinx will create a link to the file. Users who click on the link will be taken to the documentation built from this file: docs.ray.io/en/master/train/examples/intel_gaudi/sd.html; your current approach will create a link to docs.ray.io/en/master/train/examples/intel_gaudi/sd.html#hpu_sd_finetune, with the anchor being at the top of the page. Please see the sphinx docs for more information about references.

  2. Any reason this needs to be invoked in the shell (i.e. via !<command>?

!python ~/optimum-habana/examples/stable-diffusion/training/textual_inversion.py \
  --pretrained_model_name_or_path runwayml/stable-diffusion-v1-5 \
  --train_data_dir "/root/cat" \
  --learnable_property object \
  --placeholder_token "<cat-toy>" \
  --initializer_token toy \
  --resolution 512 \
  --train_batch_size 4 \
  --max_train_steps 3000 \
  --learning_rate 5.0e-04 \
  --scale_lr \
  --lr_scheduler constant \
  --lr_warmup_steps 0 \
  --output_dir /tmp/textual_inversion_cat \
  --save_as_full_pipeline \
  --gaudi_config_name Habana/stable-diffusion \
  --throughput_warmup_steps 3

!python ~/... requires the user to be running something bash-like (to have ~ expansion - not sure if this works on windows). You might want to use a different cell magic here, which might solve the lexing issue as well.

  1. The other option that might be able to help here is to set the cell metadata - there may be a way to override the way this is rendered with myst_nb.

  2. Another option would be to put the options on one line, although this does have the impact of ruining the nice spacing that you have here.

  3. The cell output is huge, 3000 lines of terminal progress bars. We should probably not have this appear in our documentation. You can remove cell output by setting cell metadata - see https://myst-nb.readthedocs.io/en/latest/render/format_code_cells.html.

Alternatively, we probably want to make use of nbstripout-fast as part of pre-commit hooks, but maybe this is a separate issue.

@woshiyyya
Copy link
Member

woshiyyya commented May 17, 2024

Hi @peytondmurray , it's me that added this (hpu_sd_finetune)= tag following our old doc style. Sorry for the confusion, we can remove it if it's no longer applicable now.

For (5), @kira-lin let's remove the outputs and only show the result object in a new cell.

@kira-lin kira-lin requested a review from woshiyyya May 22, 2024 01:36
@woshiyyya woshiyyya added the go add ONLY when ready to merge, run all tests label May 22, 2024
Copy link
Member

@woshiyyya woshiyyya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Triggered a premerge run!

Also cc @justinvyu to take a look and merge it?

@kira-lin
Copy link
Contributor Author

ping @woshiyyya @justinvyu , can we merge this?

@anyscalesam anyscalesam added the P1 Issue that should be fixed within a few weeks label May 29, 2024
@woshiyyya
Copy link
Member

Hi @peytondmurray , can you take a look again and see if the change requests can be resolved?

Copy link
Contributor

@peytondmurray peytondmurray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the one remaining request to remove the extra anchor. Once that's removed, 🚢

doc/source/train/examples/intel_gaudi/sd.ipynb Outdated Show resolved Hide resolved
@anyscalesam anyscalesam enabled auto-merge (squash) May 30, 2024 18:20
@github-actions github-actions bot disabled auto-merge May 30, 2024 18:20
@justinvyu justinvyu merged commit 18eb433 into ray-project:master May 30, 2024
7 checks passed
richardsliu pushed a commit to richardsliu/ray that referenced this pull request Jun 12, 2024
…ay-project#45217)

This PR adds an example for stable diffusion model fine-tuning and
serving using HPU. Moreover, it also covers how to adapt an existing HPU
example to run on Ray, so that users can use Ray to run the examples on
huggingface/optimum-habana.

---------

Signed-off-by: Zhi Lin <[email protected]>
Signed-off-by: Yunxuan Xiao <[email protected]>
Signed-off-by: Samuel Chan <[email protected]>
Co-authored-by: Yunxuan Xiao <[email protected]>
Co-authored-by: Yunxuan Xiao <[email protected]>
Co-authored-by: Samuel Chan <[email protected]>
Co-authored-by: Peyton Murray <[email protected]>
Signed-off-by: Richard Liu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests P1 Issue that should be fixed within a few weeks train Ray Train Related Issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants