Skip to content
This repository has been archived by the owner on Mar 1, 2024. It is now read-only.

How to recover text grounding from visual encoder #27

Open
zhouliang-yu opened this issue Feb 16, 2023 · 0 comments
Open

How to recover text grounding from visual encoder #27

zhouliang-yu opened this issue Feb 16, 2023 · 0 comments

Comments

@zhouliang-yu
Copy link

Hey, thx so much for sharing this repo!
since r3m is trained via contrastive learning, it should have learned to align the visual representation to text embeddings. So based on this, I do wonder if is there any efficient approach that when using r3m, for a given visual representation, we can further decode its textual grounding.
I think one approach is to use a pre-trained captioning model to generate captions, then further infer the description, what do you think of it?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant