Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does the IP Adapter support mounting multiple IP Adapter models simultaneously and using multiple reference images at the same time? #6318

Closed
cjt222 opened this issue Dec 25, 2023 · 38 comments
Assignees
Labels
IPAdapter stale Issues that haven't received updates

Comments

@cjt222
Copy link

cjt222 commented Dec 25, 2023

No description provided.

@cjt222 cjt222 changed the title Translation: Does the IP Adapter support mounting multiple IP Adapter models simultaneously and using multiple reference images at the same time? Does the IP Adapter support mounting multiple IP Adapter models simultaneously and using multiple reference images at the same time? Dec 25, 2023
@sayakpaul
Copy link
Member

Are you referring to something like so?
https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/adapter#combining-multiple-adapters

I don't think we support the loading of multiple IP adapters at the moment. Cc: @yiyixuxu

Do you have any convincing results from such a pipeline? Otherwise, it'd be hard for us to prioritize it.

@asomoza
Copy link
Member

asomoza commented Dec 25, 2023

I would like to comment here since is something I'm currently looking at, multiple IP Adapters its useless if it doesn't support attention masking, but with that its really good, I don't have examples but you can watch the video of the developer that made the node for comfyui:

https://www.youtube.com/watch?v=vqG1VXKteQg&t=470s

this really opens up a lot of options for generation, also its good to use them when the images are big and you don't want to lose details, so you can divide the image and assign ip adapters to the portions of the image to make the final generation.

Finally its not that important to me since it can be done outside of diffusers, but the multiple images option is really popular right now and its better know as "instant-lora", where you feed multiple images to one IP-Adapter and they're combined in the attention layer.

@sayakpaul
Copy link
Member

Thanks for the pointers. Would be great to have some code references if you have any.

@whiterose199187
Copy link

hi @sayakpaul

Here are some relevant discussions I found:

tencent-ailab/IP-Adapter#45
cubiq/ComfyUI_IPAdapter_plus#145 (comment)

@asomoza
Copy link
Member

asomoza commented Dec 26, 2023

I have a couple of things to do before I can get into this but here's the invokeai implementation of the multiple adapters:

https://github.com/invoke-ai/InvokeAI/pull/4818/files

and here's the code for the ip-adapter node attention masks:

cubiq/ComfyUI_IPAdapter_plus@ebd946f

@asomoza
Copy link
Member

asomoza commented Dec 26, 2023

I did a test in comfyui to get a better understanding:

Image 1 Image 2 Mask Image 3
20231226020109 20231226022943 mask 20231226025320

Instant lora

Image 1 and image 2 at same weight
normal
More weight image 1
weight_2
More weight image 2
weight_1
All three images same weight
three_images

Multiple IP Adapters with attention masking

Using the mask for the people only (in comfyui you can assign a mask to a color) and the prompt "two women holding each other"

same weight for each adapter produces a "two faces" effect:
full_weight

lowering the weight of each ip adapter produces the desired effect:
lower_weight

and using the third image as a background:
ComfyUI_temp_hevgl_00083_

hope it helps to better understand how do they work together.

@whiterose199187
Copy link

Hello folks,

Is there any plan to support this capability in diffusers?

@patrickvonplaten
Copy link
Contributor

cc @yiyixuxu

@yiyixuxu
Copy link
Collaborator

I think code wise it's pretty straightforward to support multiple ip-adapters.

However, I'm trying to understand if it makes sense to support this for every single pipeline? i.e. text2img, img2img, inpaint, controlnet?

@asomoza said it does not work great without the mask - this makes me think maybe we just need one community text2img pipeline that supports multiple ip-adapter, along with the mask. Let me know what you think!

@thibaudart
Copy link

I think code wise it's pretty straightforward to support multiple ip-adapters.

However, I'm trying to understand if it makes sense to support this for every single pipeline? i.e. text2img, img2img, inpaint, controlnet?

@asomoza said it does not work great without the mask - this makes me think maybe we just need one community text2img pipeline that supports multiple ip-adapter, along with the mask. Let me know what you think!

yes, often when we worked on some image we used:

  • ipadater for style (with multiple reference images)
  • another one for face
  • some controlnet too

we generate a first image then we work on it using img2img and inpainting.

@asomoza
Copy link
Member

asomoza commented Jan 10, 2024

Multiple IP Adapters without masks won't do any harm though, but IMO is just the more or less the same as one with multiple images (weighted).

I always though that diffusers pipelines were just examples so a community pipeline as an example would be sufficient and people can use it as a reference for their own. But personally I use them with controlnet and/or t2i adapters and masks all the time and almost never use them alone.

The same as a mask, I did an example with just one mask but it would be better to be able to provide each adapter with its own mask.

Probably the list for use them in a community pipeline that would be useful is

  • Multiple IP Adapters with masks
  • Multiple weighted images with each adapter
  • Negative noise for each image (it really makes a difference)
  • Controlnet and T2I Adapters
  • Start and end in steps or % for each adapter

I don't see that much use for them in img2img or inpainting but I must admit I haven't tested them that much for those tasks.

@thibaudart
Copy link

no multi ip adapters without mask are so helpful:
one for style
one for face (using a different checkpoint)

@asomoza
Copy link
Member

asomoza commented Jan 10, 2024

no worries, its just my experience, but in my tests the face adapter also interferes with the style so for me its better to use a mask for the face too.

@thibaudart
Copy link

i generally use weight 30% for face and 70% for style.

@yiyixuxu
Copy link
Collaborator

yiyixuxu commented Jan 10, 2024

I think we can:

  1. support MultiIPAdapter( without mask) for all the pipelines
  2. add community pipeline for the masking capability

what do you guys think?

cc @vladmandic for his insights too

@vladmandic
Copy link
Contributor

@yiyixuxu thanks for looping me in - and i think you're spot on.

Big value of ipadapters is their ease-of-use. and having multiple ipadapters does come up quite often.
(ipadapters are basically a fancy embeddings (done using different clip models depending on the adapter) with fixed vector count)

On the other hand, as soon you have masking of any kind in the picture, the user intention is far more manual process and its totally fine to have separate pipeline.

A bit off-topic - having separate pipelines in diffusers for features is somewhat cumbersome, especially since pipeline inheritance is less than ideal (e.g. why doesn't StableDiffusionImage2ImagePipeline inherit from StableDiffusionPipeline so I cannot check current model type easily?) and AutoPipeline is does not have full coverage and .from_pipeline even less.

IMO, we need a cleaner way to switch pipelines for already loaded pipeline - right now I'm instantiating it manually using loaded pipeline components, but it does cause issues with model offloading and things like that).

Especially using community pipelines - I cannot load from scratch just to run one generate. I want to switch to it when I want to use specific feature and then switch back.

@asomoza
Copy link
Member

asomoza commented Jan 11, 2024

yeah, I'm targeting a more professional use case (like photoshop) and no a so creative, automated or simple one, that's why I was just telling my opinion, masking without a UI is not a use case I would think people would use a lot if is not done automatically which us not the case here, also lowering the IP Face adapter to a 30% just to make it work with other adapters is not ideal for me too.

@yiyixuxu anything that could be added to the core diffusers and not pipelines works for me, right now to use them I had to monkey patch the attention processor and the unet forward method which is not ideal, the less I have to do that the better. Just the addition of multiple IP adapters would help a lot.

@vladmandic what you're commenting is the core reason I don't use pipelines, they're too rigid to be able to use them in UIs were people need the freedom to add or remove any features they want, you would need to make a pipeline for all the possible combinations or a huge one with everything in it, but I really like the design of them since they're really easy to follow and understand as a starting point.

Just another two cents here, I don't think they need to be added to diffusers but if you want to make it easier to use for people, you could add an automated negative noise image to the pipeline, here's an example of the difference:

source image without noise with noise
woman 20240110100258 20240110100319

Thanks for taking our opinions into account.

@yiyixuxu
Copy link
Collaborator

Cool! I will put out an issue. If no one picks it up quickly, I will work on it.

Also, I just looked into the mask-related code a little bit more cubiq/ComfyUI_IPAdapter_plus@ebd946f.

I think maybe we can allow IP-Adapter masks to be optionally passed in cross_attention_kwargs and handle it from the attention processor class. My main concern is that we do not want to over-complicate the pipelines. If we can get away with not adding any additional code to the pipelines we are happy to support the IP adapter mask as well.

@hipsterusername
Copy link

FWIW Invoke has supported both multiple IP adapters and images for a while now - We implemented our support before it was in Diffusers, so aren't leveraging the diffusers pipeline, but it may be useful to use as reference since we're leveraging Diffusers underneath.

@sayakpaul
Copy link
Member

Links to relevant pieces of code would be much appreciated.

@asomoza
Copy link
Member

asomoza commented Jan 12, 2024

I posted them before, but these are the PRs from InvokeAI.:

Multi-Image IP-Adapter: https://github.com/invoke-ai/InvokeAI/pull/4882/files
Support multiple IP-Adapters (workflow editor only): https://github.com/invoke-ai/InvokeAI/pull/4818/files

I learn from them too, very cool project.

@yiyixuxu
Copy link
Collaborator

I'm starting to work on this now. We opened a discussion here too #6544.

It would be very nice if any of you can provide an example that I can play with that includes:
1: IP-adapter model checkpoints you used and their respective scale weights
2. input images and other inputs needed, i.e. prompts etc
3. expected outputs from either ComfyUI or invoke

@thibaudart
Copy link

I'm starting to work on this now. We opened a discussion here too #6544.

It would be very nice if any of you can provide an example that I can play with that includes: 1: IP-adapter model checkpoints you used and their respective scale weights 2. input images and other inputs needed, i.e. prompts etc 3. expected outputs from either ComfyUI or invoke

of course:

here's my comfyUI workflow: https://github.com/fictions-ai/sharing-is-caring/blob/main/workflow_controlnet_ipadapter.json
an archive for style: https://github.com/thibaudart/dreambooth-768/raw/main/style_ziggy.zip

For models, I used SDXL versions:
https://huggingface.co/h94/IP-Adapter/blob/main/sdxl_models/ip-adapter-plus-face_sdxl_vit-h.safetensors
https://huggingface.co/h94/IP-Adapter/blob/main/sdxl_models/ip-adapter-plus_sdxl_vit-h.safetensors

@yiyixuxu
Copy link
Collaborator

hi @thibaudart
in addition to the style images, would you be able to provide the face image input, prompt, and maybe an expected output? it will be super helpful.

I started the PR here #6573. The test example I used did not have very meaningful results

@thibaudart
Copy link

input portrait: ComfyUI_temp_axfqk_00001_
prompt: wonderwoman
Face weight: 0.3
Style weight: 0.7
Result: image

@yiyixuxu
Copy link
Collaborator

@thibaudart thanks!

so cool !! 🤩

@thibaudart
Copy link

@yiyixuxu my pleasure.

@sayakpaul
Copy link
Member

@yiyixuxu has a done great job of adding its support in #6573. Look out for the merge :)

@asomoza
Copy link
Member

asomoza commented Jan 31, 2024

Thank you for your hard work @yiyixuxu.

@yiyixuxu
Copy link
Collaborator

yiyixuxu commented Feb 1, 2024

@asomoza I'm not too familiar with the "negative noise" feature you pointed out here.
can you provide some reference?

@asomoza
Copy link
Member

asomoza commented Feb 1, 2024

no problem @yiyixuxu , the IP Adapters allow to pass a negative image which is rarely used, in fact in the diffusers code, the implementation it's just to create a zero filled tensor for each image:

uncond_image_enc_hidden_states = self.image_encoder(
torch.zeros_like(image), output_hidden_states=True
).hidden_states[-2]

What I do and thanks to the developer (@cubiq) of the ComfyUI IPAdapterPlus Node (I saw this there and nowhere else) is to instead pass a noisy image created from the original image, for this I just use the same code:

    image = image.permute([0,3,1,2])
    torch.manual_seed(0) # use a fixed random for reproducible results
    transforms = TT.Compose([
        TT.CenterCrop(min(image.shape[2], image.shape[3])),
        TT.Resize((224, 224), interpolation=TT.InterpolationMode.BICUBIC, antialias=True),
        TT.ElasticTransform(alpha=75.0, sigma=noise*3.5), # shuffle the image
        TT.RandomVerticalFlip(p=1.0), # flip the image to change the geometry even more
        TT.RandomHorizontalFlip(p=1.0),
    ])
    image = transforms(image.cpu())
    image = image.permute([0,2,3,1])
    image = image + ((0.25*(1-noise)+0.05) * torch.randn_like(image) )   # add further random noise

https://github.com/cubiq/ComfyUI_IPAdapter_plus/blob/46241f3ba5401f076f8d90c2aa85f2194910e1a9/IPAdapterPlus.py#L170

Where noise is the parameter I control with the UI for each image in each IP Adapter, so for example in the case of just one image:

IP Adapter

Source zero filled negative 0.05 noise 0.2 noise 1 noise
woman 20240201055830 20240201060009 20240201060345 20240201060441

IP Adapter PLUS

zero filled 0.05 noise 0.2 noise 1 noise
20240201060829 20240201060943 20240201061006 20240201061053

What it does is that it allows more freedom to the generation so it can add more details or you can change more the image with a prompt, for example, the same image with the prompt "white background" and a t2i line art adapter:

zero filled 0.2 noise 1 noise
20240201061954 20240201062159 20240201062011

This is also good for styles, for example if we take the same example we were using of the "wonder woman"

zero filled style 1 noise style
20240201062808 20240201062905

so at the end, is just another parameter you can use to control the generation, IMO it makes them better but sometimes I need the details so the zero filled tensor also works. The ComfyUI Node implements the noise at the Adapter level which means that all the images of the same adapter have the same amount of noise (makes more sense for diffusers) and I for more control implemented this on each image.

I really don't know if this should be implemented in diffusers since I think most people don't want to make too much effort in the generations and it might become too cumbersome without a user interface.

@cubiq
Copy link

cubiq commented Feb 1, 2024

you can create custom noise and send it to the negative image. It also works very well with mandelbrot noise.

img

@asomoza
Copy link
Member

asomoza commented Feb 1, 2024

thanks @cubiq, I tested it with mandelbrot noise and indeed it works nice, specially for the normal IP Adapter. I will add it too, and also test more kind of noise algorithms.

Just for the fun of it I linked the noise slider with the iterations.

Zero 10 iterations 50 iterations 100 iterations
20240201114241 20240201114300 20240201114702 20240201114713

@cubiq
Copy link

cubiq commented Feb 1, 2024

nice! you probably need to lower the CFG or use some kind of CFG rescaling strategy

@yiyixuxu
Copy link
Collaborator

@asomoza @cubiq

I think maybe we can just create a nice section on our doc about this!! no? We introduced the ip_adapter_image_embeds arguments now thanks to @sayakpaul. We can just create the image embedding with the negative noise and pass it to the pipelines as ip_adapter_image_embeds. We don't need to add any code to diffusers this way

let me know what you think!
cc @stevhliu here too

@asomoza
Copy link
Member

asomoza commented Feb 19, 2024

I'm having my doubts about the negative noise being useful in diffusers, you need to fiddle with it a lot to get the results you want, this is easy with UIs but to run the entire pipeline again to see if the output gets any better is not very practical.

I added and tested 6 types of noise and each of them gives different results which makes it even more harder to test in a pipeline.

Maybe this would be better in the new tips and tricks section you're thinking about adding with a basic example. If people are interested or use it, then maybe we can expand on it.

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Mar 14, 2024
@sayakpaul
Copy link
Member

Closing this as it seems we have added support for the feature from many different angles. Feel free to reopen if that's not the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IPAdapter stale Issues that haven't received updates
Projects
None yet
Development

No branches or pull requests

10 participants