Is context images really needed during few-shot inference? #278

zhang9302002 · 2023-11-07T10:17:42Z

Dear author,

I am reproducing few-shot image captioning task recently. I notice that in Flamingo and OpenFlamingo setting, one token can only attend to one previous image (or none). This means that, suppose we're performing a k-shot image caption, the newly generated token can only attend to the query image, and therefore the previous k context images can't be accessed anyhow. The generation process only depends on the query image, and the context (image, text) pair serve as text tokens and '' token only, not containing any visual information encoded.

I tried some experiments and found that using (image, text) as context, or (text) as context, this 2 setting seem have very similar CIDEr. I'm wondering if it means Flamingo few-shot inference only depends on k pure-text context, instead of (image, text) paired context? Or if I missed some details.

Thank you :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is context images really needed during few-shot inference? #278

Is context images really needed during few-shot inference? #278

zhang9302002 commented Nov 7, 2023

Is context images really needed during few-shot inference? #278

Is context images really needed during few-shot inference? #278

Comments

zhang9302002 commented Nov 7, 2023