Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

agent How to implement a rich content format with both text and images? #167

Open
wolkerzheng opened this issue Sep 24, 2024 · 2 comments
Open
Labels
question Further information is requested

Comments

@wolkerzheng
Copy link

I have more than 20 different pictures, how can I call up the appropriate picture as part of the answer in the agent?

@moyanxinxu
Copy link

目前看来,不行

@AniviaTn
Copy link
Collaborator

AniviaTn commented Sep 25, 2024

We currently don't have a ready-made solution, but you can achieve your goal based on the following two approaches:

  1. You can merge all the images into one large image and then submit it to a multimodal model like gpt-4o, using prompt to ask it provide the coordinates of the selected image in the merged image.
  2. You can use a multimodal model like CLIP to perform embeddings on the images and text, and then retrieve results based on the similarity of the vectors."

@LandJerry LandJerry added the question Further information is requested label Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants