Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions Related to the Application and Results of Attention Sinks After the Paper #28

Closed
dsdanielpark opened this issue Nov 13, 2023 · 2 comments

Comments

@dsdanielpark
Copy link

Questions Related to the Application and Results of Attention Sinks After the Paper

  1. Hello, I was deeply impressed by your paper. I thought that many models would apply attention sinks since the issue with the initial token receiving a disproportionate amount of weight was resolved. However, it seems that even after some time has passed, they are not being applied as much as I expected. May I ask what the authors think might be the reason for this?

  2. I am curious whether it is better to apply attention sinks during model training or model inference, and whether there has been any performance degradation verified after the paper. In fact, I do not intuitively expect a significant improvement in speed overall, but I wonder if performance should not be slightly higher. Alternatively, I also think that intuitively, giving more weight to the early parts of a sentence might be a method to enhance the overall understanding of the sentence.

Therefore, I am curious about how the authors' thoughts have changed after the paper.

@tomaarsen
Copy link
Owner

Hello!

First of all, I want to point out that I'm not one of the paper authors! The official GitHub for the paper is: https://github.com/mit-han-lab/streaming-llm Feel free to copy your issue to there!

However, I can try to answer the questions myself as I'm definitely knowledgeable on this as well.

  1. I agree with you here - I expected this to very quickly be adapted by practitioners. My theory is that it's not commonly used as not many people care for longer "fluency" - people mostly care about longer context lengths, which is not something that attention sinks provides. However, transformers is working on an implementation: Generate: New Cache abstraction and Attention Sinks support huggingface/transformers#26681

  2. I can't say for certain, as I've only applied it during model inference. My experiments show that the inference speed is higher than full/dense attention after more tokens have been generated than the window size. However, there is a very slight loss in perplexity, intuitively meaning a slight loss in understanding.

  • Tom Aarsen

@dsdanielpark
Copy link
Author

tomaarsen

Thank you for your invaluable insights. It's been a great help.

I also plan to apply attention sinks in my reasoning process. The key point seems to be the potential difference in information loss between documents like abstracts, where tokens at the beginning contain crucial information, and standard formats like insurance policies that just meet the necessary criteria.

It appears evident that speed and performance involve a certain degree of trade-off.

Once again, thank you for the wonderful project and your opinions.
I'll come back with more questions if I have any.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants