-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[OA][template] Finetune stable-diffusion (dreambooth) #148
Conversation
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried this out, pretty cool. A couple comments:
- Since we're trying to de-emphasize NFS, can you move these storage paths to use $ANYSCALE_ARTIFACT_STORAGE instead of NFS?
- The notebook has a lot of spam in the cell outputs, can you clear the outputs before saving them? If there are particular outputs to highlight you can include those as screenshots explicitly in the markdown of the cell.
Also, I think the last cell "print("\n".join(finetuned_images)) |
When I ran it, it didn't seem to succeed saving the checkpoint also. |
A lot of the usage is pretty awkward to replace with S3: the HF script just reads from Is the recommendation to never use NFS for performance reasons? Is it still okay as a place to share some lightweight files between nodes?
The last 2 cells should be showing before vs. after fine-tuning images. |
Also, another bug at the moment is that autoscaling w/ |
Unfortunately, we have a plan to remove NFS in the next coming sprints for
AIOA clouds, due to high cost to maintain.
…On Fri, Mar 22, 2024, 1:33 PM Justin Yu ***@***.***> wrote:
Also, another bug at the moment is that autoscaling w/ accelerator_type
is broken and is waiting on ray-project/ray#44225
<ray-project/ray#44225> to be included in nightly.
—
Reply to this email directly, view it on GitHub
<#148 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAADUSSBE6I4O42R5CSLTPLYZSIRJAVCNFSM6AAAAABFAB6KIOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJVHA3DKNBRGM>
.
You are receiving this because you were assigned.Message ID:
***@***.***>
|
Signed-off-by: Justin Yu <[email protected]>
@ericl Got it, is it ok to assume S3 for now? Will we be adding a GCP hosted cloud any time soon? |
Maybe you can use pyarrow to open the filesystem whether it's GCP or AWS? We are currently onboarding to both platforms.
|
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
" train_loop_config={\"args\": TRAINING_ARGS},\n", | ||
" scaling_config=ray.train.ScalingConfig(\n", | ||
" # Do data parallel training with A10G GPU workers\n", | ||
" num_workers=4, use_gpu=True, accelerator_type=\"A10G\"\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a lot of workers for a tiny job, why not 1 or 2 at most?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original example uses an effective batch size of 4 by accumulating a batch size of 1 on a single device. Now, I just scale out the cluster to 4 gpus and don't do gradient accumulation to make the training run faster.
Finetuning requires around 100 steps to get decent results, which actually takes quite a bit of time with fewer workers. The current setup takes around 8 minutes.
@justinvyu I'm still having issues after commenting out the broken wandb line, any idea?
|
@ericl Oh oops, I made some changes without doing another round of testing -- there was a bug in the train function. I'll make the fixes and post a link to a workspace with a working version. |
Signed-off-by: Justin Yu <[email protected]>
Here is a workspace with the fixed template: https://console.anyscale-staging.com/v2/cld_kvedZWag2qA8i5BjxUevf5i7/workspaces/expwrk_s5yeb3qu6wheis18ksujmx2ttb/ses_nja945defkghjxpz2yi39956sc?workspace-tab=vscode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Latest changes look good, just a few minor comments + try it out on GCP.
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
@ericl Could you merge? |
Merged. Similar to the other PR, @justinvyu can you follow up to make PRs to add it to the product repo for OA home and also endpoints-doc repo? EX: https://github.com/anyscale/endpoint-docs/pull/126 and https://github.com/anyscale/product/pull/27319 |
Ok done, ptal! |
[OA][template] Finetune stable-diffusion (dreambooth)
Adds a dreambooth stable diffusion finetuning template for OA.
The original dreambooth template re-implemented a huggingface accelerate example.
Additions of the new template:
Removals:
This takes inspiration from the Modal template, which writes very minimal code and leverages the existing huggingface example. Modal just focuses on showing how easy it is to take a huggingface accelerate script and run it on a cluster. We can show the same thing with Ray Train + Anyscale.