Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roadmap of YOLO-World #109

Open
8 of 16 tasks
wondervictor opened this issue Mar 7, 2024 · 20 comments
Open
8 of 16 tasks

Roadmap of YOLO-World #109

wondervictor opened this issue Mar 7, 2024 · 20 comments
Assignees
Labels
discussions The issue might be helpful or contains useful information enhancement New feature or request help wanted Extra attention is needed

Comments

@wondervictor
Copy link
Collaborator

wondervictor commented Mar 7, 2024

This issue will be kept open and pinned for a long time, as we hope to hear everyone's opinions, suggestions, and needs!
We want to make YOLO-World stronger and encourage more diverse applications, especially practical ones. We maintain an open and free attitude. YOLO-World is currently in active development and improvement, and we are trying our best to do well in upstream pre-training and downstream deployment tools. At present, our manpower is limited, so we hope you can give us some time and contribute your experience or help when you can!

If you have a good idea or need, just reply to this issue and @ me. I will respond promptly when I see it, and consider adding it to the TODO list.

这个issue将会长时间保持开放并置顶,因为我们希望听到大家的意见、建议和需求!
我们希望让YOLO-World变得更强大,并鼓励更多样化的应用,尤其是实际应用。我们保持开放和自由的态度。YOLO-World目前正处于积极的开发和改进阶段,我们正在尽最大努力做好上游预训练和下游部署工具。目前,我们的人力有限,因此希望大家能给我们一些时间,并在可以的时候贡献您的经验或帮助!

如果您有好的想法或需求,请回复此问题并@我。我看到后会及时回应,并考虑将其加入待办事项列表。

TODO List (Community Version)

🎯: High priority or on-going.

@wondervictor wondervictor added enhancement New feature or request help wanted Extra attention is needed discussions The issue might be helpful or contains useful information labels Mar 7, 2024
@wondervictor wondervictor pinned this issue Mar 7, 2024
@taofuyu
Copy link
Contributor

taofuyu commented Mar 8, 2024

torch.einsum() should be replaced by torch.matmul() and torch.sum(), because einsum() is not supported by most edge devices.
For example, I rewrite the code:
x = torch.einsum('bchw,bkc->bkhw', x, w)
to
batch, channel, height, width = x.shape
_, k, _ = w.shape
x = x.permute(0, 2, 3, 1) # bchw->bhwc
x = x.reshape(batch, -1, channel) # bhwc->b(hw)c
w = w.permute(0, 2, 1) # bkc->bck
x = torch.matmul(x, w)
x = x.reshape(batch, height, width, k)
x = x.permute(0, 3, 1, 2)
Maybe it is ugly, but it can be deployed.
@wondervictor

@wondervictor
Copy link
Collaborator Author

@taofuyu Good idea, Got it!

@mio410
Copy link

mio410 commented Mar 8, 2024

@wondervictor May I ask where should I modify if I want to try using the effect of other text encoders, such as changing the text encoder of CLIP to BEIT-3. Thank you!

@wondervictor
Copy link
Collaborator Author

@mio410 Good idea, we do plan to use better and stronger text encoders (e.g., CLIP-Large) now and we are queuing for computation resources to pre-train it. BEIT-3 is a good choice and we are considering it. BTW, what model size are you most in need of currently? I can prioritize that.

@mio410
Copy link

mio410 commented Mar 8, 2024

@wondervictor May I ask where should I modify if I want to try using the effect of other text encoders, such as changing the text encoder of CLIP to BEIT-3. Thank you!
Besides, I'd like to try using a CLIP model in a different language to see if I can use prompts in that language for open vocabulary detection. Is this possible?

@mio410
Copy link

mio410 commented Mar 8, 2024

@mio410 Good idea, we do plan to use better and stronger text encoders (e.g., CLIP-Large) now and we are queuing for computation resources to pre-train it. BEIT-3 is a good choice and we are considering it. BTW, what model size are you most in need of currently? I can prioritize that.

I'm looking forward to your work! If possible, I'd like to try open vocabulary detection in other languages. Could you help me with that?

@taofuyu
Copy link
Contributor

taofuyu commented Mar 8, 2024

@wondervictor May I ask where should I modify if I want to try using the effect of other text encoders, such as changing the text encoder of CLIP to BEIT-3. Thank you!

here

@wondervictor wondervictor self-assigned this Mar 8, 2024
@dikapiliao1
Copy link

Yolo World is based on the word embedding of clip for reparameterization. If we could replace clip with a larger model similar to ChatGPT4, would it understand more? similar to Sora's powerful ability to understand images.

@wondervictor
Copy link
Collaborator Author

Yolo World is based on the word embedding of clip for reparameterization. If we could replace clip with a larger model similar to ChatGPT4, would it understand more? similar to Sora's powerful ability to understand images.

Hi @dikapiliao1, it's a nice idea and we plan to do it.

@xianhonghuang
Copy link

如果我想要更改不同的視覺的backbone要在哪裡可以更改?

@wondervictor
Copy link
Collaborator Author

如果我想要更改不同的視覺的backbone要在哪裡可以更改?

@xianhonghuang replace the image_model config according to your demand:

backbone=dict(
    _delete_=True,
    type='MultiModalYOLOBackbone',
    image_model={{_base_.model.backbone}},
    text_model=dict(
        type='HuggingCLIPLanguageBackbone',
        model_name=text_model_name,
        frozen_modules=['all'])),

@xianhonghuang
Copy link

xianhonghuang commented Mar 28, 2024

如果我想更改不同的主幹線要在哪裡可以更改?

@xianhonghuangimage_model根據您的需求替換配置:

backbone=dict(
    _delete_=True,
    type='MultiModalYOLOBackbone',
    image_model={{_base_.model.backbone}},
    text_model=dict(
        type='HuggingCLIPLanguageBackbone',
        model_name=text_model_name,
        frozen_modules=['all'])),

像是更改_base_ = ('../../third_party/mmyolo/configs/yolov8/'
'yolov8_l_syncbn_fast_8xb16-500e_coco.py')這部分嗎?
我想要先更改成Yolov7的backbone

@wondervictor
Copy link
Collaborator Author

Hi @xianhonghuang, you can directly override the backbone dictionary configs, e.g., change it to YOLOv7Backbone. BTW, it's suggested to open a new issue to discuss this question and this issue aims for new features and suggestions.

@RudyCheng
Copy link

config:yolo_world_v2_xl_vlpan_bn_2e-3_100e_4x8gpus_obj365v1_goldg_train_lvis_minival.py is not suit for its model weights

@wondervictor
Copy link
Collaborator Author

@RudyCheng, it has been resolved.

@xiyuan27
Copy link

xiyuan27 commented May 2, 2024

[target detection on document images], Are there any specialized optimization strategies or support for target detection in vertical domains, specifically for document images such as invoices and passports?

@thgpddl
Copy link

thgpddl commented Jun 24, 2024

请问为什么image_demo.py输入的text经过","分割后,还要追加一个空字符串呢,加入text=cat,dog,man,经过代码处理后text=cat,dog,man," "

@spacewalk01
Copy link

any progress on tensorrt implementation? thanks

@PrinceP
Copy link

PrinceP commented Aug 4, 2024

@myb1314yxy
Copy link

I would like to ask, how should I use yolo-world to implement unknown classes detection in my own dataset, how should the dataset be divided and made, do I need to pre-define all known classes in yaml file?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussions The issue might be helpful or contains useful information enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests