[OA][template] Finetune stable-diffusion (dreambooth) #148

justinvyu · 2024-03-20T18:58:52Z

Adds a dreambooth stable diffusion finetuning template for OA.

The original dreambooth template re-implemented a huggingface accelerate example.

Additions of the new template:

Uses the huggingface example rather than re-implementing it -- this reduces our maintenance burden in the future to upgrade to SD3. Let's just take the latest version of the diffusers example rather than build a separate version in parallel.
- This also makes it easier to add other methods that customers are considering like textual inversion, etc. No need to rebuild all those examples.
Fine-tunes SDXL instead of SD1.5. ("Using the latest and greatest models.")
Shifts the focus of the template away from understanding the training code + learning about Ray Data where it's not really needed, to showing that it is easy to launch a distributed training job from a user's existing script. Fewer concepts to learn when getting started.
The data size is very small for stable diffusion fine-tuning, which makes Ray Data not as important to showcase.

Removals:

No longer uses ray data for "batch inference" to generate images in parallel to construct the regularization dataset when using a prior preservation loss. It is now done serially (the original script logic).

This takes inspiration from the Modal template, which writes very minimal code and leverages the existing huggingface example. Modal just focuses on showing how easy it is to take a huggingface accelerate script and run it on a cluster. We can show the same thing with Ray Train + Anyscale.

Signed-off-by: Justin Yu <[email protected]>

templates/fine-tune-stable-diffusion/train.py

Signed-off-by: Justin Yu <[email protected]>

ericl

Tried this out, pretty cool. A couple comments:

Since we're trying to de-emphasize NFS, can you move these storage paths to use $ANYSCALE_ARTIFACT_STORAGE instead of NFS?
The notebook has a lot of spam in the cell outputs, can you clear the outputs before saving them? If there are particular outputs to highlight you can include those as screenshots explicitly in the markdown of the cell.

ericl · 2024-03-22T19:50:12Z

Also, I think the last cell "print("\n".join(finetuned_images))
display(*[Image(filename=image_path, width=250) for image_path in finetuned_images])" is repeated twice.

ericl · 2024-03-22T19:57:56Z

huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/mnt/cluster_storage/checkpoint-final'. Use `repo_type` argument if needed.

When I ran it, it didn't seem to succeed saving the checkpoint also.

ericl · 2024-03-22T19:58:13Z

Try it here: https://console.anyscale-staging.com/v2/cld_kvedZWag2qA8i5BjxUevf5i7/workspaces/expwrk_px58enxhtvjh5vwysyvh4p8x2p/ses_wgm6f872qhegth6wiahvz99nkv?workspace-tab=vscode&command-history-section=application_logs&search=checkpoint

justinvyu · 2024-03-22T20:32:56Z

Since we're trying to de-emphasize NFS, can you move these storage paths to use $ANYSCALE_ARTIFACT_STORAGE instead of NFS?

A lot of the usage is pretty awkward to replace with S3: the HF script just reads from SUBJECT_IMAGES_DIR as a directory, and HF_HOME is very useful to set so that worker startup is quicker since it can skip the model download time. Is it ok to trade-off showing NFS usage for a better user experience?

Is the recommendation to never use NFS for performance reasons? Is it still okay as a place to share some lightweight files between nodes?

Also, I think the last cell "print("\n".join(finetuned_images))
display(*[Image(filename=image_path, width=250) for image_path in finetuned_images])" is repeated twice.

The last 2 cells should be showing before vs. after fine-tuning images.

justinvyu · 2024-03-22T20:33:02Z

Also, another bug at the moment is that autoscaling w/ accelerator_type is broken and is waiting on ray-project/ray#44225 to be included in nightly.

ericl · 2024-03-22T21:09:27Z

Unfortunately, we have a plan to remove NFS in the next coming sprints for AIOA clouds, due to high cost to maintain.

…

On Fri, Mar 22, 2024, 1:33 PM Justin Yu ***@***.***> wrote: Also, another bug at the moment is that autoscaling w/ accelerator_type is broken and is waiting on ray-project/ray#44225 <ray-project/ray#44225> to be included in nightly. — Reply to this email directly, view it on GitHub <#148 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADUSSBE6I4O42R5CSLTPLYZSIRJAVCNFSM6AAAAABFAB6KIOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJVHA3DKNBRGM> . You are receiving this because you were assigned.Message ID: ***@***.***>

Signed-off-by: Justin Yu <[email protected]>

justinvyu · 2024-03-22T21:24:24Z

@ericl Got it, is it ok to assume S3 for now? Will we be adding a GCP hosted cloud any time soon?

ericl · 2024-03-22T21:31:59Z

Maybe you can use pyarrow to open the filesystem whether it's GCP or AWS? We are currently onboarding to both platforms.

import pyarrow.fs
import os

fs, path = pyarrow.fs.FileSystem.from_uri(os.environ["ANYSCALE_ARTIFACT_STORAGE"])
print(fs, path)

fs.get_file_info(path + "/my-final-checkpoint")

Signed-off-by: Justin Yu <[email protected]>

templates/fine-tune-stable-diffusion/README.ipynb

ericl · 2024-03-27T18:05:11Z

templates/fine-tune-stable-diffusion/README.ipynb

+    "    train_loop_config={\"args\": TRAINING_ARGS},\n",
+    "    scaling_config=ray.train.ScalingConfig(\n",
+    "        # Do data parallel training with A10G GPU workers\n",
+    "        num_workers=4, use_gpu=True, accelerator_type=\"A10G\"\n",


This is a lot of workers for a tiny job, why not 1 or 2 at most?

The original example uses an effective batch size of 4 by accumulating a batch size of 1 on a single device. Now, I just scale out the cluster to 4 gpus and don't do gradient accumulation to make the training run faster.

Finetuning requires around 100 steps to get decent results, which actually takes quite a bit of time with fewer workers. The current setup takes around 8 minutes.

ericl · 2024-03-27T18:21:53Z

@justinvyu I'm still having issues after commenting out the broken wandb line, any idea?

RayTaskError(TypeError): ray::_Inner.train() (pid=2620, ip=10.0.40.86, actor_id=1d725aa6167d331c9a4e7e3204000000, repr=TorchTrainer)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 334, in train
    raise skipped from exception_cause(skipped)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/utils.py", line 53, in check_for_failure
    ray.get(object_ref)
ray.exceptions.RayTaskError(TypeError): ray::_RayTrainWorker__execute.get_next() (pid=2656, ip=10.0.13.128, actor_id=d62f6cf6e383bebea1201b0a04000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7f964c517c10>)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/worker_group.py", line 33, in __execute
    raise skipped from exception_cause(skipped)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/utils.py", line 169, in discard_return_wrapper
    train_func(*args, **kwargs)
  File "/tmp/ipykernel_1246/2651988845.py", line 15, in train_fn_per_worker
TypeError: cannot unpack non-iterable NoneType object

The above exception was the direct cause of the following exception:

TrainingFailedError                       Traceback (most recent call last)
Cell In[11], line 2
      1 # Launch the training.
----> 2 trainer.fit()

File ~/anaconda3/lib/python3.9/site-packages/ray/train/base_trainer.py:638, in BaseTrainer.fit(self)
    634 result = result_grid[0]
    635 if result.error:
...
    641 return result

ericl · 2024-03-27T18:22:13Z

This workspace btw https://console.anyscale-staging.com/v2/cld_kvedZWag2qA8i5BjxUevf5i7/workspaces/expwrk_lwx3xcrpwc1m4kvmc8icdyp5wn/ses_89hgq68tkwrige3qawxvpabb65?workspace-tab=vscode

justinvyu · 2024-03-27T18:48:58Z

@ericl Oh oops, I made some changes without doing another round of testing -- there was a bug in the train function. I'll make the fixes and post a link to a workspace with a working version.

Signed-off-by: Justin Yu <[email protected]>

justinvyu · 2024-03-27T19:46:05Z

Here is a workspace with the fixed template: https://console.anyscale-staging.com/v2/cld_kvedZWag2qA8i5BjxUevf5i7/workspaces/expwrk_s5yeb3qu6wheis18ksujmx2ttb/ses_nja945defkghjxpz2yi39956sc?workspace-tab=vscode

templates/fine-tune-stable-diffusion/README.ipynb

ericl

Latest changes look good, just a few minor comments + try it out on GCP.

Signed-off-by: Justin Yu <[email protected]>

justinvyu · 2024-03-28T20:26:32Z

Template also works on GCP.

@ericl Could you merge?

ericl · 2024-03-28T20:35:25Z

Merged. Similar to the other PR, @justinvyu can you follow up to make PRs to add it to the product repo for OA home and also endpoints-doc repo?

EX: https://github.com/anyscale/endpoint-docs/pull/126 and https://github.com/anyscale/product/pull/27319

justinvyu · 2024-03-28T22:42:47Z

Ok done, ptal!

[OA][template] Finetune stable-diffusion (dreambooth)

add first draft of dreambooth template

22d37c8

Signed-off-by: Justin Yu <[email protected]>

matthewdeng reviewed Mar 20, 2024

View reviewed changes

templates/fine-tune-stable-diffusion/train.py Outdated Show resolved Hide resolved

2nd draft

1ad1e78

Signed-off-by: Justin Yu <[email protected]>

justinvyu assigned ericl and shomilj Mar 22, 2024

ericl reviewed Mar 22, 2024

View reviewed changes

fix lint

f377f26

Signed-off-by: Justin Yu <[email protected]>

justinvyu added 6 commits March 25, 2024 17:56

draft without nfs

9326094

remove ray train checkpointing

edbe89a

Signed-off-by: Justin Yu <[email protected]>

clean up util.py

2d10497

Signed-off-by: Justin Yu <[email protected]>

regenerate readme

ec3538e

Signed-off-by: Justin Yu <[email protected]>

remove key

64dc7ed

Signed-off-by: Justin Yu <[email protected]>

fix readme image link to be absolute

2d99ab0

Signed-off-by: Justin Yu <[email protected]>

justinvyu requested a review from ericl March 27, 2024 16:57

ericl reviewed Mar 27, 2024

View reviewed changes

templates/fine-tune-stable-diffusion/README.ipynb Outdated Show resolved Hide resolved

ericl reviewed Mar 27, 2024

View reviewed changes

templates/fine-tune-stable-diffusion/README.ipynb Outdated Show resolved Hide resolved

ericl reviewed Mar 27, 2024

View reviewed changes

fix wandb api key issue and train fn bug

77531e9

Signed-off-by: Justin Yu <[email protected]>

justinvyu commented Mar 27, 2024

View reviewed changes

templates/fine-tune-stable-diffusion/README.ipynb Outdated Show resolved Hide resolved

ericl reviewed Mar 27, 2024

View reviewed changes

templates/fine-tune-stable-diffusion/README.ipynb Outdated Show resolved Hide resolved

ericl approved these changes Mar 27, 2024

View reviewed changes

fix for gcp + hide base model images by default

c6facc8

Signed-off-by: Justin Yu <[email protected]>

ericl mentioned this pull request Mar 27, 2024

[OA] Fixes for Batch Inference Basics template #156

Merged

fix setting wandb api key on cluster

f5d1a08

Signed-off-by: Justin Yu <[email protected]>

ericl added 2 commits March 28, 2024 13:29

Merge branch 'main' into finetune_sd_dreambooth

992ba62

Merge branch 'main' into finetune_sd_dreambooth

2cc91c7

ericl merged commit a324503 into main Mar 28, 2024
1 check passed

justinvyu deleted the finetune_sd_dreambooth branch March 28, 2024 22:09

anmscale pushed a commit that referenced this pull request Jun 22, 2024

Merge pull request #148 from anyscale/finetune_sd_dreambooth

ea1f6ad

[OA][template] Finetune stable-diffusion (dreambooth)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OA][template] Finetune stable-diffusion (dreambooth) #148

[OA][template] Finetune stable-diffusion (dreambooth) #148

justinvyu commented Mar 20, 2024 •

edited

Loading

ericl left a comment

ericl commented Mar 22, 2024

ericl commented Mar 22, 2024

ericl commented Mar 22, 2024

justinvyu commented Mar 22, 2024

justinvyu commented Mar 22, 2024

ericl commented Mar 22, 2024 via email

justinvyu commented Mar 22, 2024

ericl commented Mar 22, 2024

ericl Mar 27, 2024

justinvyu Mar 27, 2024

ericl commented Mar 27, 2024

ericl commented Mar 27, 2024

justinvyu commented Mar 27, 2024

justinvyu commented Mar 27, 2024

ericl left a comment

justinvyu commented Mar 28, 2024

ericl commented Mar 28, 2024 •

edited

Loading

justinvyu commented Mar 28, 2024

[OA][template] Finetune stable-diffusion (dreambooth) #148

[OA][template] Finetune stable-diffusion (dreambooth) #148

Conversation

justinvyu commented Mar 20, 2024 • edited Loading

ericl left a comment

Choose a reason for hiding this comment

ericl commented Mar 22, 2024

ericl commented Mar 22, 2024

ericl commented Mar 22, 2024

justinvyu commented Mar 22, 2024

justinvyu commented Mar 22, 2024

ericl commented Mar 22, 2024 via email

justinvyu commented Mar 22, 2024

ericl commented Mar 22, 2024

ericl Mar 27, 2024

Choose a reason for hiding this comment

justinvyu Mar 27, 2024

Choose a reason for hiding this comment

ericl commented Mar 27, 2024

ericl commented Mar 27, 2024

justinvyu commented Mar 27, 2024

justinvyu commented Mar 27, 2024

ericl left a comment

Choose a reason for hiding this comment

justinvyu commented Mar 28, 2024

ericl commented Mar 28, 2024 • edited Loading

justinvyu commented Mar 28, 2024

justinvyu commented Mar 20, 2024 •

edited

Loading

ericl commented Mar 28, 2024 •

edited

Loading