distributed RPC-based speculative evaluation #1

okuvshynov · 2024-05-19T01:20:07Z

plan copypasta from ggerganov/llama.cpp#6853 (reply in thread):

We might not even need to write too much new code for this, I suppose. Given that models are separate, we can start (main_A + speculative) on instance_A, (main_B + speculative) on instance_B. Then we need to orchestrate the data/logic passing during transition phase:

In the 'middle' of main model processing (A is done with first half), we need to pass activations to B and whatever B speculated so far back to A
At the end of main model processing (B is done with logits) we need to get whatever latest speculation on B is, consolidate it with what we have currently produced on A, pass the 'current approved tokens' to A, start speculating on B.
repeat

Relevant links:

Devices I can test it on are:

M2 Ultra 192GB
M2 24GB
M1 16GB

okuvshynov · 2024-05-21T20:16:46Z

Work will be done here: https://github.com/okuvshynov/llama.cpp/tree/duo, as I'll need some changes to llama.cpp itself.

okuvshynov self-assigned this May 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distributed RPC-based speculative evaluation #1

distributed RPC-based speculative evaluation #1

okuvshynov commented May 19, 2024

okuvshynov commented May 21, 2024

distributed RPC-based speculative evaluation #1

distributed RPC-based speculative evaluation #1

Comments

okuvshynov commented May 19, 2024

okuvshynov commented May 21, 2024