You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Questions Related to the Application and Results of Attention Sinks After the Paper
Hello, I was deeply impressed by your paper. I thought that many models would apply attention sinks since the issue with the initial token receiving a disproportionate amount of weight was resolved. However, it seems that even after some time has passed, they are not being applied as much as I expected. May I ask what the authors think might be the reason for this?
I am curious whether it is better to apply attention sinks during model training or model inference, and whether there has been any performance degradation verified after the paper. In fact, I do not intuitively expect a significant improvement in speed overall, but I wonder if performance should not be slightly higher. Alternatively, I also think that intuitively, giving more weight to the early parts of a sentence might be a method to enhance the overall understanding of the sentence.
Therefore, I am curious about how the authors' thoughts have changed after the paper.
The text was updated successfully, but these errors were encountered:
However, I can try to answer the questions myself as I'm definitely knowledgeable on this as well.
I agree with you here - I expected this to very quickly be adapted by practitioners. My theory is that it's not commonly used as not many people care for longer "fluency" - people mostly care about longer context lengths, which is not something that attention sinks provides. However, transformers is working on an implementation: Generate: New Cache abstraction and Attention Sinks support huggingface/transformers#26681
I can't say for certain, as I've only applied it during model inference. My experiments show that the inference speed is higher than full/dense attention after more tokens have been generated than the window size. However, there is a very slight loss in perplexity, intuitively meaning a slight loss in understanding.
Thank you for your invaluable insights. It's been a great help.
I also plan to apply attention sinks in my reasoning process. The key point seems to be the potential difference in information loss between documents like abstracts, where tokens at the beginning contain crucial information, and standard formats like insurance policies that just meet the necessary criteria.
It appears evident that speed and performance involve a certain degree of trade-off.
Once again, thank you for the wonderful project and your opinions.
I'll come back with more questions if I have any.
Questions Related to the Application and Results of Attention Sinks After the Paper
Hello, I was deeply impressed by your paper. I thought that many models would apply attention sinks since the issue with the initial token receiving a disproportionate amount of weight was resolved. However, it seems that even after some time has passed, they are not being applied as much as I expected. May I ask what the authors think might be the reason for this?
I am curious whether it is better to apply attention sinks during model training or model inference, and whether there has been any performance degradation verified after the paper. In fact, I do not intuitively expect a significant improvement in speed overall, but I wonder if performance should not be slightly higher. Alternatively, I also think that intuitively, giving more weight to the early parts of a sentence might be a method to enhance the overall understanding of the sentence.
Therefore, I am curious about how the authors' thoughts have changed after the paper.
The text was updated successfully, but these errors were encountered: