You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We currently don't have a ready-made solution, but you can achieve your goal based on the following two approaches:
You can merge all the images into one large image and then submit it to a multimodal model like gpt-4o, using prompt to ask it provide the coordinates of the selected image in the merged image.
You can use a multimodal model like CLIP to perform embeddings on the images and text, and then retrieve results based on the similarity of the vectors."
I have more than 20 different pictures, how can I call up the appropriate picture as part of the answer in the agent?
The text was updated successfully, but these errors were encountered: