CoT-decoding (alternative method to Greedy) #9620

enthermo · 2024-09-24T05:41:37Z

enthermo
Sep 24, 2024

As mentioned in this paper, CoT-decoding explores alternative top-k tokens for higher confidence answers, resulting in measurable and significant performance gains and longer chain-of-thought reasoning from even non-instruct LLMs.

A post on reddit detailed their own implementation of CoT-decoding in Python (source code). Using Qwen 2.5 0.5B they achieved a 41% improvement in the GSM8K (22.82 before, 32.37 after).

All models tested in the paper, both base and instruct-tuned, showed significant improvement in accuracy and reasoning with unprompted and prompted zero-shot problems.

qnixsynapse · 2024-09-24T06:32:34Z

qnixsynapse
Sep 24, 2024

Very interesting concept. Subscribing myself to this discussion for checking out later. 🙂

0 replies

steampunque · 2024-09-25T14:18:17Z

steampunque
Sep 25, 2024

All models tested in the paper, both base and instruct-tuned, showed significant improvement in accuracy and reasoning with unprompted and prompted zero-shot problems.

This looks like a poor mans beam search which just continues with one (arbitrary) single beam after some prescribed number of tokens >=1 have been decoded. The paper seems over complicated for such a simple concept that can be described in one sentence. Not that I don't like the idea, its a very good idea to add a parameter to a beam search to halt full beam after N decoded tokens. Alternatives such as dropping to K beams K<N or even staging an arbitrary programmable sequence (N0 beams for first token, N1<=N0 for second, N2 <=N1 for third, etc.) could easily be explored and might give some decode benefits while reducing complexity of a full beam search.

Based on my experience I agree with the comments in the paper that the early tokens can make a large disproportionate difference on the continuation quality. I havent done extensive testing with beam search but the limited tests I have run show it can sometimes give correct answers to problems that will not be answered correctly with single beam greedy decode with only 2 to 3 beams.

Simply healing the last prompt token before generation will also give a guaranteed completion probability equal or greater than the non healed last prompt token decode. This can also make the difference between a right and wrong answer showing how important the early decode tokens are due to the autoregressive feedback mechanism in the decoder.

5 replies

enthermo Sep 25, 2024
Author

Aside from your comment of it being a continuation of one arbitrary beam, I don't disagree with that assessment. I feel that this is a potential improvement with the guaranteed trade-off of computational complexity, and the only reason I propose its implementation is the prospect of improving preexisting models is in my opinion appealing. If I knew the ins and outs of llama.cpp I would attempt my own implementation but I don't, nor do I know C++.

Also you're right, this paper is filled with fluff, far too much of it.

steampunque Sep 25, 2024

Aside from your comment of it being a continuation of one arbitrary beam, I don't disagree with that assessment. I feel that this is a potential improvement with the guaranteed trade-off of computational complexity, and the only reason I propose its implementation is the prospect of improving preexisting models is in my opinion appealing. If I knew the ins and outs of llama.cpp I would attempt my own implementation but I don't, nor do I know C++.

Also you're right, this paper is filled with fluff, far too much of it.

Thank you for starting this discussion. The subject area is very interesting. It also confirmed some things I had noticed working with LLMs for awhile now about how important early token gens are and I think there are many possible ways to approach concentrating on optimizing early part of decode (i.e. full N beam search for some prescribed number of token gens, then pick highest probability beam and continue with that single best beam for rest of decode is the approach that I think would be quite interesting, reasonably simple and potentially give some nice payout to improve performance of the decodes without paying the xN computational price for a full beam search.).

enthermo Sep 26, 2024
Author

Tested the Python code from my original post on the new Llama 3.2 3B Instruct and Gemma-2-9B-it. Llama was capable of completing basic math and logic problems and did so in a CoT format with zero prompting (only the initial question), although CoT decoding seems to do little help with reinforcing accurate knowledge. Gemma-2-9B-it was rather rigid and the additional computational complexity greatly slowed generation time.

Of course these are all observations with zero benchmarking, but it seems this decoding method would primarily benefit low parameter, high token/s models with basic logic.

steampunque Sep 26, 2024

Of course these are all observations with zero benchmarking, but it seems this decoding method would primarily benefit low parameter, high token/s models with basic logic.

The complexity multiplier for K top token path evaluation (K arbitrary, selection of final correct decode path among the K decoded paths using a heuristic) is approximately K, under the assumption each path has similar decode length. I see zero benefit to this idea over a standard length K beam search which is guaranteed to find the decode path with maximized cumulative probability over the K beams with the same computational overhead. Its just a highly suboptimal K path decode splitting at the first decode. Its neither optimal nor efficient. They also rely on Q: {prompt} to trigger the model into a reasoning path which is only slightly more efficient that simply stating to "use step by step reasoning" (near zero difference in prompt processing overhead for even a modestly long question). I honestly don't see anything useful here outside of the simple observation that greedy decode often won't take a CoT path for Q: {prompt} input.

enthermo Sep 26, 2024
Author

I believe what you're referring to was tested in the paper.

bennmann · 2024-10-21T00:26:54Z

bennmann
Oct 21, 2024

I am also interested in how CoT-decoding may influence creative writing, aside from also wanting it implemented for reasoning tasks. Upvoted.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CoT-decoding (alternative method to Greedy) #9620

{{title}}

Replies: 3 comments 5 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

CoT-decoding (alternative method to Greedy) #9620

enthermo Sep 24, 2024

Replies: 3 comments · 5 replies

qnixsynapse Sep 24, 2024

steampunque Sep 25, 2024

enthermo Sep 25, 2024 Author

steampunque Sep 25, 2024

enthermo Sep 26, 2024 Author

steampunque Sep 26, 2024

enthermo Sep 26, 2024 Author

bennmann Oct 21, 2024

enthermo
Sep 24, 2024

Replies: 3 comments 5 replies

qnixsynapse
Sep 24, 2024

steampunque
Sep 25, 2024

enthermo Sep 25, 2024
Author

enthermo Sep 26, 2024
Author

enthermo Sep 26, 2024
Author

bennmann
Oct 21, 2024