Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OA][template] Finetune stable-diffusion (dreambooth) #148

Merged
merged 14 commits into from
Mar 28, 2024

Conversation

justinvyu
Copy link
Contributor

@justinvyu justinvyu commented Mar 20, 2024

Adds a dreambooth stable diffusion finetuning template for OA.

The original dreambooth template re-implemented a huggingface accelerate example.

Additions of the new template:

  • Uses the huggingface example rather than re-implementing it -- this reduces our maintenance burden in the future to upgrade to SD3. Let's just take the latest version of the diffusers example rather than build a separate version in parallel.
    • This also makes it easier to add other methods that customers are considering like textual inversion, etc. No need to rebuild all those examples.
  • Fine-tunes SDXL instead of SD1.5. ("Using the latest and greatest models.")
  • Shifts the focus of the template away from understanding the training code + learning about Ray Data where it's not really needed, to showing that it is easy to launch a distributed training job from a user's existing script. Fewer concepts to learn when getting started.
  • The data size is very small for stable diffusion fine-tuning, which makes Ray Data not as important to showcase.

Removals:

  • No longer uses ray data for "batch inference" to generate images in parallel to construct the regularization dataset when using a prior preservation loss. It is now done serially (the original script logic).

This takes inspiration from the Modal template, which writes very minimal code and leverages the existing huggingface example. Modal just focuses on showing how easy it is to take a huggingface accelerate script and run it on a cluster. We can show the same thing with Ray Train + Anyscale.

Signed-off-by: Justin Yu <[email protected]>
Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried this out, pretty cool. A couple comments:

  1. Since we're trying to de-emphasize NFS, can you move these storage paths to use $ANYSCALE_ARTIFACT_STORAGE instead of NFS?
  2. The notebook has a lot of spam in the cell outputs, can you clear the outputs before saving them? If there are particular outputs to highlight you can include those as screenshots explicitly in the markdown of the cell.

@ericl
Copy link
Contributor

ericl commented Mar 22, 2024

Also, I think the last cell "print("\n".join(finetuned_images))
display(*[Image(filename=image_path, width=250) for image_path in finetuned_images])" is repeated twice.

@ericl
Copy link
Contributor

ericl commented Mar 22, 2024

huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/mnt/cluster_storage/checkpoint-final'. Use `repo_type` argument if needed.

When I ran it, it didn't seem to succeed saving the checkpoint also.

@justinvyu
Copy link
Contributor Author

Since we're trying to de-emphasize NFS, can you move these storage paths to use $ANYSCALE_ARTIFACT_STORAGE instead of NFS?

A lot of the usage is pretty awkward to replace with S3: the HF script just reads from SUBJECT_IMAGES_DIR as a directory, and HF_HOME is very useful to set so that worker startup is quicker since it can skip the model download time. Is it ok to trade-off showing NFS usage for a better user experience?

Is the recommendation to never use NFS for performance reasons? Is it still okay as a place to share some lightweight files between nodes?

Also, I think the last cell "print("\n".join(finetuned_images))
display(*[Image(filename=image_path, width=250) for image_path in finetuned_images])" is repeated twice.

The last 2 cells should be showing before vs. after fine-tuning images.

@justinvyu
Copy link
Contributor Author

Also, another bug at the moment is that autoscaling w/ accelerator_type is broken and is waiting on ray-project/ray#44225 to be included in nightly.

@ericl
Copy link
Contributor

ericl commented Mar 22, 2024 via email

Signed-off-by: Justin Yu <[email protected]>
@justinvyu
Copy link
Contributor Author

@ericl Got it, is it ok to assume S3 for now? Will we be adding a GCP hosted cloud any time soon?

@ericl
Copy link
Contributor

ericl commented Mar 22, 2024

Maybe you can use pyarrow to open the filesystem whether it's GCP or AWS? We are currently onboarding to both platforms.

import pyarrow.fs
import os

fs, path = pyarrow.fs.FileSystem.from_uri(os.environ["ANYSCALE_ARTIFACT_STORAGE"])
print(fs, path)

fs.get_file_info(path + "/my-final-checkpoint")

@justinvyu justinvyu requested a review from ericl March 27, 2024 16:57
" train_loop_config={\"args\": TRAINING_ARGS},\n",
" scaling_config=ray.train.ScalingConfig(\n",
" # Do data parallel training with A10G GPU workers\n",
" num_workers=4, use_gpu=True, accelerator_type=\"A10G\"\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot of workers for a tiny job, why not 1 or 2 at most?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original example uses an effective batch size of 4 by accumulating a batch size of 1 on a single device. Now, I just scale out the cluster to 4 gpus and don't do gradient accumulation to make the training run faster.

Finetuning requires around 100 steps to get decent results, which actually takes quite a bit of time with fewer workers. The current setup takes around 8 minutes.

@ericl
Copy link
Contributor

ericl commented Mar 27, 2024

@justinvyu I'm still having issues after commenting out the broken wandb line, any idea?

RayTaskError(TypeError): ray::_Inner.train() (pid=2620, ip=10.0.40.86, actor_id=1d725aa6167d331c9a4e7e3204000000, repr=TorchTrainer)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 334, in train
    raise skipped from exception_cause(skipped)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/utils.py", line 53, in check_for_failure
    ray.get(object_ref)
ray.exceptions.RayTaskError(TypeError): ray::_RayTrainWorker__execute.get_next() (pid=2656, ip=10.0.13.128, actor_id=d62f6cf6e383bebea1201b0a04000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7f964c517c10>)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/worker_group.py", line 33, in __execute
    raise skipped from exception_cause(skipped)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/utils.py", line 169, in discard_return_wrapper
    train_func(*args, **kwargs)
  File "/tmp/ipykernel_1246/2651988845.py", line 15, in train_fn_per_worker
TypeError: cannot unpack non-iterable NoneType object

The above exception was the direct cause of the following exception:

TrainingFailedError                       Traceback (most recent call last)
Cell In[11], line 2
      1 # Launch the training.
----> 2 trainer.fit()

File ~/anaconda3/lib/python3.9/site-packages/ray/train/base_trainer.py:638, in BaseTrainer.fit(self)
    634 result = result_grid[0]
    635 if result.error:
...
    641 return result

@justinvyu
Copy link
Contributor Author

@ericl Oh oops, I made some changes without doing another round of testing -- there was a bug in the train function. I'll make the fixes and post a link to a workspace with a working version.

@justinvyu
Copy link
Contributor Author

Here is a workspace with the fixed template: https://console.anyscale-staging.com/v2/cld_kvedZWag2qA8i5BjxUevf5i7/workspaces/expwrk_s5yeb3qu6wheis18ksujmx2ttb/ses_nja945defkghjxpz2yi39956sc?workspace-tab=vscode

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest changes look good, just a few minor comments + try it out on GCP.

@justinvyu
Copy link
Contributor Author

Template also works on GCP.

@ericl Could you merge?

@ericl ericl merged commit a324503 into main Mar 28, 2024
1 check passed
@ericl
Copy link
Contributor

ericl commented Mar 28, 2024

Merged. Similar to the other PR, @justinvyu can you follow up to make PRs to add it to the product repo for OA home and also endpoints-doc repo?

EX: https://github.com/anyscale/endpoint-docs/pull/126 and https://github.com/anyscale/product/pull/27319

@justinvyu justinvyu deleted the finetune_sd_dreambooth branch March 28, 2024 22:09
@justinvyu
Copy link
Contributor Author

Ok done, ptal!

anmscale pushed a commit that referenced this pull request Jun 22, 2024
[OA][template] Finetune stable-diffusion (dreambooth)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants