Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distributed RPC-based speculative evaluation #1

Open
okuvshynov opened this issue May 19, 2024 · 1 comment
Open

distributed RPC-based speculative evaluation #1

okuvshynov opened this issue May 19, 2024 · 1 comment
Assignees

Comments

@okuvshynov
Copy link
Owner

plan copypasta from ggerganov/llama.cpp#6853 (reply in thread):

We might not even need to write too much new code for this, I suppose. Given that models are separate, we can start (main_A + speculative) on instance_A, (main_B + speculative) on instance_B. Then we need to orchestrate the data/logic passing during transition phase:

In the 'middle' of main model processing (A is done with first half), we need to pass activations to B and whatever B speculated so far back to A
At the end of main model processing (B is done with logits) we need to get whatever latest speculation on B is, consolidate it with what we have currently produced on A, pass the 'current approved tokens' to A, start speculating on B.
repeat

Relevant links:

Devices I can test it on are:

  • M2 Ultra 192GB
  • M2 24GB
  • M1 16GB
@okuvshynov okuvshynov self-assigned this May 19, 2024
@okuvshynov
Copy link
Owner Author

Work will be done here: https://github.com/okuvshynov/llama.cpp/tree/duo, as I'll need some changes to llama.cpp itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant