Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

text and vision embedding does not match #28

Open
DrAlexLiu opened this issue Sep 24, 2024 · 0 comments
Open

text and vision embedding does not match #28

DrAlexLiu opened this issue Sep 24, 2024 · 0 comments

Comments

@DrAlexLiu
Copy link

I am trying to train this model using image and text function.

However,

craftsman:

image_features = self.model.visual_projection(pooler_output)

vision_outputs has not projection, its embedding is 1024 (visual_embeds shape: torch.Size([32, 4, 257, 1024])
)

Transformers:
https://github.com/huggingface/transformers/blob/be9cf070ee2cb6a9f0d162e5be32d9d68b9df3af/src/transformers/models/clip/modeling_clip.py#L1503

image_embeds has projection, its embedding is 768

text_features = self.model.text_projection(pooler_output)

But text_features has its projection, its embedding is 768 (text_embeds shape: torch.Size([32, 77, 768]))

Eventually,

return torch.cat([text_embeds, visual_embeds], dim=1)

it gives me shape error for torch.cat of these two paramters:
visual_embeds shape: torch.Size([32, 4, 257, 1024])
text_embeds shape: torch.Size([32, 77, 768])

So eventually, I cannot use pretrain weights to fine tune the text-3D. I only can fine-tune the image-3D.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant