-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Internals: generators and backtracking
The jq language is a generator/backtracking-type functional language.
Expressions produce zero, one, or more results. Producing zero results
is the same backtracking, and as a result the empty
builtin triggers
backtracking.
When backtracking, the VM unwinds the stack, freeing garbage as it
goes, up to a saved point (a ”fork point“), restarting at that point.
The resumed instruction knows that a backtrack has occurred and can do a
variety of things as a result, such as: produce the next result in some
low-level operation (e.g., EACH
, for .[]
, INDEX
for .[expr]
).
Many built-in generators work by catching backtracking and producing the
next value until there are no more values to produce, in which case they
backtrack.
Fundamental generators in jq:
- the comma operator (
FORK
opcode) -
.[]
(iterate values of.
) (EACH
opcode) -
range(init;upto)
(RANGE
opcode) - the
path()
builtin (PATH_BEGIN
andPATH_END
opcodes) - the last expression, the one that outputs values to the calling C
program (
RET
opcode)
Generators can be implemented in terms of existing jq language
functionality. For example, the new range(init;upto;by)
builtin is a
jq-coded function, using only:
- the comma operator for generation
- tail-call optimized recursion for iteration
- sub-functions (yes, jq functions can define sub-functions!)
Programmers familiar with the internals of Schemes, LISPs, or more esoteric languages like Prolog or Icon, will probably understand easily how jq generators and backtracking work.
Programmers familiar with GCC's local function extension to the C language will also easily understand how jq generators and backtracking work.
For everyone else the key to understanding jq generators and backtracking is as follows:
- Calling a function pushes a new call frame on the jq interpreted stack, much as one would expect in a typical C implementation
- Returning from a function, however, is actually like a
yield
in Python and other such languages:- The RET instruction calls
frame_pop()
, which one would expect… - …but the
frame_pop()
function often does not pop anything from the stack, instead it changes the frame pointer to point to the previous frame. - The RET instruction then pushes a backtrack point onto the stack with
stack_push()
. The ”return“ v
- The RET instruction calls
The effect of this is that returning from a jq function leaves the callee's frame on the stack and pushes a backtrack point. Eventually the callee will be restarted by backtracking to its RET
instruction, which will restore the current frame pointer to be the callee's. If the caller being yielded to needs to push more things onto the stack, these get pushed past the callee's frame. At some point the callee will have no more values to output, in which case it will backtrack. Backtracking is more similar to returning in C than is yielding.
Note that frames are back-linked in a way that corresponds to the jq program's pipes. The compiler counts the nesting level (number of pipes to cross) to get to a symbol such as a variable reference, and records this as a “level” in the bytecode produced for, e.g., a LOADV
instruction. The interpreter then traverses that many frames to find the frame in which the binder for that symbol is to be found. Note that frames of callees are not in the caller's back-link list! This is how lexical binding works in jq then, and it is akin to deep binding in that it requires iteration to find symbols, with global-most symbol references requiring the most iteration to resolve. (The run-time could use something more like shallow binding, but the cost would be to copy, at frame-push time, all bound references that might be needed by the new frame's bytecode.)
A compilation to C with local functions would map the jq output/yield/return operation to a function call of a continuation, where a continuation is the function pointer of a local function that closes over local variables of its parent function. Backtracking would map to a normal C function return, or perhaps to calling a computed goto (another GCC extension to C).
Of course, a better compilation to low-level code would avoid using extended C local functions as those require a system call to permit execution of the stack page where the local function object is allocated. (The GCC local function implementation creates executable function objects on the stack in order to take their address; these are known as ”trampolines“. In principle C could have larger pointers for local function pointers such that calling such a function pointer does not require an executable trampoline on the stack, but in practice the type name for such function pointers would have to indicate that the pointer is a closure pointer, and C has not been extended to make this possible, not even by GCC, therefore trampolines are unavoidably necessary.) Instead a standard C function + callback argument could be used, or compilation to the LLVM IR could be used to bypass C altogether to produce even more optimized code.
- Home
- FAQ
- jq Language Description
- Cookbook
- Modules
- Parsing Expression Grammars
- Docs for Oniguruma Regular Expressions (RE.txt)
- Advanced Topics
- Guide for Contributors
- How To
- C API
- jq Internals
- Tips
- Development