agent How to implement a rich content format with both text and images? #167

wolkerzheng · 2024-09-24T13:33:10Z

I have more than 20 different pictures, how can I call up the appropriate picture as part of the answer in the agent?

moyanxinxu · 2024-09-25T08:28:27Z

目前看来，不行

AniviaTn · 2024-09-25T09:36:22Z

We currently don't have a ready-made solution, but you can achieve your goal based on the following two approaches:

You can merge all the images into one large image and then submit it to a multimodal model like gpt-4o, using prompt to ask it provide the coordinates of the selected image in the merged image.
You can use a multimodal model like CLIP to perform embeddings on the images and text, and then retrieve results based on the similarity of the vectors."

LandJerry added the question Further information is requested label Oct 23, 2024

Provide feedback