text and vision embedding does not match #28

DrAlexLiu · 2024-09-24T03:51:59Z

I am trying to train this model using image and text function.

However,

craftsman：

CraftsMan/craftsman/models/conditional_encoders/clip_encoder.py

Line 138 in 2f9ff14

image_features = self.model.visual_projection(pooler_output)

vision_outputs has not projection, its embedding is 1024 (visual_embeds shape: torch.Size([32, 4, 257, 1024])
)

Transformers:
https://github.com/huggingface/transformers/blob/be9cf070ee2cb6a9f0d162e5be32d9d68b9df3af/src/transformers/models/clip/modeling_clip.py#L1503

image_embeds has projection, its embedding is 768

CraftsMan/craftsman/models/conditional_encoders/clip_encoder.py

Line 163 in 2f9ff14

text_features = self.model.text_projection(pooler_output)

But text_features has its projection, its embedding is 768 (text_embeds shape: torch.Size([32, 77, 768]))

Eventually,

CraftsMan/craftsman/models/conditional_encoders/base.py

Line 97 in 2f9ff14

return torch.cat([text_embeds, visual_embeds], dim=1)

it gives me shape error for torch.cat of these two paramters:
visual_embeds shape: torch.Size([32, 4, 257, 1024])
text_embeds shape: torch.Size([32, 77, 768])

So eventually, I cannot use pretrain weights to fine tune the text-3D. I only can fine-tune the image-3D.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

text and vision embedding does not match #28

text and vision embedding does not match #28

DrAlexLiu commented Sep 24, 2024

text and vision embedding does not match #28

text and vision embedding does not match #28

Comments

DrAlexLiu commented Sep 24, 2024