Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why the query tokens at the backbone ? #4

Open
IISCAditayTripathi opened this issue Mar 10, 2022 · 1 comment
Open

Why the query tokens at the backbone ? #4

IISCAditayTripathi opened this issue Mar 10, 2022 · 1 comment

Comments

@IISCAditayTripathi
Copy link

In the DETR model, the query tokens are used in the decoder part only, however, in VIDT the query tokens are also used for at the backbone. What is the reason behind this and what would happen if you use query tokens at the decoder only?

@songhwanjun
Copy link
Collaborator

songhwanjun commented Apr 6, 2022

Yes, the query tokens are used for the decoder in DETR. This is a good design choice because there is an independent Transformer encoder in-between the backbone and the Transformer decoder. The encoder transforms the feature (originally for the image classification) into a more suitable form for detection. (In detail, the classification model mainly focuses on the discriminative parts, such as the legs or head, of the scene for classification. But, detection needs to see the whole area of the target object. Thus, the encoder is necessary for feature transfer).

However, by moving the object queries into the backbone, we can directly extract detection features from the Swin Transformers. The backbone is trained to be an object detector by adding the query tokens and fine-tuning.

If we use the query tokens only at the decoder layer, the performance significantly drops.

Best,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants