Fix stack corruption caused by Fn call primitives #807
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Excerpt from @munificent on the nature of the bug:
In runInterpreter, for performance, the vm caches an IP pointing into some bytecode.
All primitives except for
.call
do not touch Wren's own callstack. They run a little C code and return, so the array of CallFrames, their IPs, and the IP cached inside run() are not affected at all.While runInterpreter() is running, the IP in the top CallFrame is not updated, so it gets out of sync. This is deliberate, since storing to a field is slow, but it means the value of that field is stale and doesn't represent where execution actually is at that point in time.
To get that field in sync, we use STORE_FRAME(), which stores the local IP value back into the IP field for the top CallFrame. The interpreter is careful to always call STORE_FRAME() before executing any code that pushes a new CallFrame onto the stack.
In particular, if you look around, you'll see that every place the interpreter calls wrenCallFunction() is preceded by a STORE_FRAME(). That is, except for the call to wrenCallFunction() in the call_fn() primitive. That's the bug.
The .call() method on Fn is special because it does modify the Wren call stack and the C code for that primitive directly calls wrenCallFunction(). When that happens, the correct IP for the current function, which lives only in runInterpreter()'s local variable gets discarded and you're left with a stale IP in the CallFrame.
Giving the function call primitives a different method type and having the case for that method type call STORE_FRAME() before invoking the primitive fixes the bug.
Benchmarks from this fix: