perform AstGen on whole files at once (AST->ZIR) #8516

andrewrk · 2021-04-13T09:06:52Z

This is a language proposal as well as a concrete plan for how to implement it. It solves #335 and goes a long way towards making the problematic issue #3028 unneeded. The implementation plan simplifies the compiler and yet opens up straightforward opportunities for parallelism and caching.

In stage2 we have a concept of "AstGen" which stands for Abstract Syntax Tree Generation. This is the part where we input an AST and output Zig Intermediate Representation code.

Currently, this is done lazily as-needed per Decl (top level declaration). This requires code to orchestrate per-Decl ZIR code and independently manage memory lifetimes. It also means each Decl uses independent arrays of ZIR tags, instruction lists, string tables, and auxiliary lists. When a file is modified, the compiler checks which Decl source bytes differ, and repeats AstGen for the changed Decls to generate updated ZIR code.

One key design strategy is to make ZIR code immutable, typeless, and depend only on AST. This ensures that it can be re-used for multiple generic instantiations, comptime function calls, and inlined function calls.

This proposal takes that design strategy, and observes that it is possible to generate ZIR for an entire file indiscriminately, for all Decls, depending on AST alone and not introducing any type checking. Furthermore, it observes that this allows implementing the following compile errors:

Unused private function
Unused local variable
Unused private global variable
Unreachable code
Local variable not mutated

All of these compile errors are possible with AstGen alone, and do not require types. In fact, trying to implement these compile errors with types is problematic because of conditional compilation. But there is no conditional compilation with AstGen. Doing entire files at once would make it possible to have compile errors for unused private functions and globals.

With the way that ZIR is encoded, doing all of a file into one piece of ZIR code is less overhead than splitting it by Decl. Less overhead of list capacity is wasted, and more strings in the string table will be shared.

This works great for caching. All source files independently need to be converted to ZIR, and once converted to ZIR, the original source, token list, and AST node list are all no longer needed. The relevant bytes will be stored directly in ZIR. So each .zig source file will have exactly one corresponding ZIR bytecode. It's easy to imagine a caching strategy for this. Consider also that the transformation from .zig to ZIR does not depend on the target options, or anything, other than the AST. So cached ZIR for std lib files and common used packages can be re-used between unrelated projects.

Furthermore, thanks to #2206, the compiler can optimistically look for all .zig source files in a project, and parallelize each tokenize->parse->ZIR transformation. The caching system can notice when .zig source files are unchanged, and load the .ZIR code directly instead of the source, skipping tokenization, parsing, and AstGen entirely, on a per-file basis. The AST tree would only need to be loaded in order to report compile errors.

Serialization of ZIR in binary form is straightforward. It consists only of:

List of u8 tags for each instruction
List of u32, u32 data for each instruction
List of u8 string table
List of u32 auxiliary data
Writing/reading this to/from a file is trivial.

See #8516. * AstGen is now done on whole files at once rather than per Decl. * Introduce a new wait group for AstGen tasks. `performAllTheWork` waits for all AstGen tasks to be complete before doing Sema, single-threaded. - The C object compilation tasks are moved to be spawned after AstGen, since they only need to complete by the end of the function. With this commit, the codebase compiles, but much more reworking is needed to get things back into a useful state.

zigazeljko · 2021-04-30T10:23:21Z

All source files independently need to be converted to ZIR, and once converted to ZIR, the original source, token list, and AST node list are all no longer needed.

How is debug info handled in this case? ZIR describes locations in terms of node/token indices, so AST is still needed to obtain line and column numbers for DWARF info.

andrewrk · 2021-04-30T16:52:04Z

dbg_stmt ZIR instructions are emitted which indicate the beginning of statements, and contain line/column information. This was already true before whole-file-astgen. Only difference is that before dbg_stmt used node indexes, and got resolved into line/column later, and now they are resolved to line/column in AstGen. This is more efficient because AstGen is where the source bytes are loaded in memory for other reasons such as looking at identifiers and string literals.

I have not finished implementing all of the above in this branch yet.

andrewrk · 2021-05-19T01:07:24Z

This is nearly completed by #8554. All that is remaining is to implement the new compile errors.

andrewrk · 2021-05-19T01:10:25Z

Actually this is done now, since the compile errors are covered by (accepted) proposals #224 and #335.

andrewrk added this to the 0.8.0 milestone Apr 13, 2021

andrewrk mentioned this issue Apr 16, 2021

Stage2 whole file astgen #8554

Merged

andrewrk added the accepted This proposal is planned. label Apr 21, 2021

andrewrk mentioned this issue Apr 21, 2021

compile errors for unused things #335

Open

5 tasks

andrewrk closed this as completed May 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perform AstGen on whole files at once (AST->ZIR) #8516

perform AstGen on whole files at once (AST->ZIR) #8516

andrewrk commented Apr 13, 2021

zigazeljko commented Apr 30, 2021

andrewrk commented Apr 30, 2021 •

edited

Loading

andrewrk commented May 19, 2021

andrewrk commented May 19, 2021

perform AstGen on whole files at once (AST->ZIR) #8516

perform AstGen on whole files at once (AST->ZIR) #8516

Comments

andrewrk commented Apr 13, 2021

zigazeljko commented Apr 30, 2021

andrewrk commented Apr 30, 2021 • edited Loading

andrewrk commented May 19, 2021

andrewrk commented May 19, 2021

andrewrk commented Apr 30, 2021 •

edited

Loading