-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
我觉得这篇文章特别好 #2
Comments
Thanks a lot for you highly acknowledgment to our work and sharing your thoughts! Regarding your questions/comments:
Thanks again for your comments, and please let us know if you have any questions :) |
Thank you very much for your patient response. I think you did a great job. It would be even better to add a zero-shot evaluation, and the evaluation should include category names and REC phrases to highlight the strengths of LLM and the algorithm's generalization. The NIV metric in the paper does indeed have some reference value. |
感谢作者,还开源了。
首先这篇文章解决了两个主要问题:
相比于现阶段大家都一股脑的多模态拉成序列丢到 LLM 中进行自回归来说,本文做法明显是更可取的(这里不是说无脑 MLLM 就一定不行,只是说现阶段还存在很多问题)
LLM 天生就适合做识别任务,这算是他的强项,但是在没有提供高分辨率情况下区域理解或者提取啥的可能就不是他的强项了,此时就需要互补。 在现实应用中有不少人会对 CV 的识别结果再次经过 MLLM 进行概念纠正,也是这个思路,不过本文做成了一个端到端方案。
本方案相比于目前主流的开放词汇检测OVD 来说,我觉得是一个非常好的思路,至少在思路上是秒杀 OVD 的,因为 OVD 要构建词汇表,这注定应用范围很有限,但是也有一定应用场景,因为本文算法即可以做 OVD 也可以做没有词汇表的概念识别,我觉得这个思路是我最喜欢的。
至于性能上好像没有和 OVD 拉开差距,原因可能有几点:
我觉得第三点和第四点最重要,特别是现在大模型大数据时代来说,在大数据加持下模型结构本身的性能差异应该不大。
当然目前方案对于每一个 mask 都要运行一遍 LLM,推理成本是蛮高的。
以上只是我个人的拙见,目前代码还没有细看,等有了更多的体会再来更新。也欢迎大家在这里交流!!!
再次感谢作者
The text was updated successfully, but these errors were encountered: