Syntax flaw: Block statements and terminating semicolon #1677

Hejsil · 2018-10-21T20:57:48Z

Currently, block expressions (for, if, while) are only terminated with ;, if the ending block of that expression is not a block. This makes the grammar very hard (maybe even impossible) to make context-free. My best bet at formalizing a grammar for this have been something along these lines:

Statement 
    : IfStatement
    | Block
    ...
    | Expr Semicolon

IfStatement
    : "if" GroupedExpr option(Payload) Statement
    | "if" GroupedExpr option(Payload) Expr "else" option(Payload) Statement

IfExpr
    : "if" GroupedExpr option(Payload) Expr option("else" option(Payload) Expr)

Sadly, this grammar requires N token lookahead at best, and at worst it is ambiguous because a Block is also an Expr.

The text was updated successfully, but these errors were encountered:

Hejsil · 2018-10-21T21:00:57Z

Consider this example:

test "" {
    if (true)
        if (true)
            if (true)
                if (true)
                    if (true)
                        if (true)
                            return;
}

kyle-github · 2018-10-22T04:21:08Z

Personally I prefer to enforce braces around block bodies. Perl does this and skips a lot of pain. C-like languages generally allow you to skip the braces if there is only one statement in the body.

As Apple found out, it is not always a good idea to leave out the braces.

Hejsil · 2018-10-22T12:34:06Z

I think the solution to this is related to #760 #114 (this comment) and maybe even #1676.

We can change the syntax of the language so that syntactic constructs that will never pass semantic analysis will give a parse error. Consider this piece of code:

test "" {
    1 + 1; // error: expression value is ignored
}

No matter what the LHS and RHS are for this operator, this code will never pass semantic analysis, because + always returns an expression that can shouldn't be ignored. We can, therefore, split expressions into categories, where only some are valid as statements. This is the same solution as this. With a system like this, we can disallow the if expressions at the statement level and the ambiguity is solved.

@kyle-github We can then consider enforcing blocks, but it is not necessary to solve this issue (i think).

thejoshwolfe · 2018-10-22T12:37:20Z

I'm not an expert on formal language theory, so I don't understand the problem in this issue. Does the syntax flaw manifest in any way other than purely theoretical?

Hejsil · 2018-10-22T12:44:49Z

@thejoshwolfe Well, if we wanna have a formal grammar, it should be unambiguous, so that differently implemented parses don't vary in behavior. Otherwise, what is the point of having a grammar?

We could ask the same question for #760. The stage1 compiler takes a choice, and always follows that choice, so in implementation, there is no ambiguity. We still consider it a problem though, because ambiguities make code harder to reason about and read (even if the parser is consistent).

Hejsil · 2018-10-22T12:51:45Z

Also, it makes using parser generators to parse the Zig language trivial, which helps with specifying and testing the grammar itself. If we can have a grammar that can actually parse code by generating a parser, then, if some compilers vary in what syntax they parse, then you can point to the grammar and its generates parser and say "well, the parser that does the same as this generated one is correct" (This will help when new syntax have to be added to stage1 and stage2).

Hejsil · 2018-10-22T12:53:44Z

I've also heard, that different C++ compiler disagrees on what syntax is valid. I'd be nice if we could avoid such a mess. :)

Hejsil · 2018-10-22T13:02:35Z

3rdly! Having a grammar we can generate from allows from quick prototyping of syntax. We still have #114, and that is a big change. We should probably ensure we get it right, before rewriting a 2000 line parser :)

thejoshwolfe · 2018-10-23T00:49:19Z

Here are some excerpts from the grammar at the bottom of the langref docs:

Block = option(Symbol ":") "{" many(Statement) "}"

Statement = LocalVarDecl ";"
          | Defer(Block)
          | Defer(Expression) ";"
          | BlockExpression(Block)
          | Expression ";"
          | ";"

Defer(body) = ("defer" | "errdefer") body

BlockExpression(body) = Block
                      | IfExpression(body)
                      | IfErrorExpression(body)
                      | TestExpression(body)
                      | WhileExpression(body)
                      | ForExpression(body)
                      | SwitchExpression
                      | CompTimeExpression(body)
                      | SuspendExpression(body)

IfExpression(body) = "if" "(" Expression ")" body option("else" BlockExpression(body))

I'm not totally clear on what a context-free grammar is, but I'm assuming that this parametric (body) pattern doesn't qualify for that.

I don't think this is ambiguous. You're supposed to match the patterns in highest-to-lowest precedent as listed. So if Defer(Block) matches, then don't bother trying to match Defer(Expression) ";". Is this not good enough?

thejoshwolfe · 2018-10-23T01:06:34Z

We don't need the rules in #114 to be encoded in the grammar. Those rules can be enforced after parsing by examining the AST. I'm more optimistic about that approach generating more helpful error messages anyway. I understand that that will result in sloppy implementations failing to reject invalid Zig programs, but I think that's a minor concern compared to implementations failing to parse correct Zig programs.

thejoshwolfe · 2018-10-23T01:21:30Z

It sounds like you may be suggesting that an if at statement level should always require { braces } around the body/bodies. I count 112 violations of this already in the std lib:

$ grep -RP '^ *if .*[^{]$' std/zig/ | head
std/zig/parse.zig:        if (token_ptr.id == Token.Id.Eof) break;
std/zig/parse.zig:        if (shebang_tok_ptr.id != Token.Id.ShebangLine) break :shebang;
std/zig/parse.zig:        if (next_tok.id != Token.Id.LineComment) break;
std/zig/parse.zig:                    if (eatToken(&tok_it, &tree, Token.Id.Comma) == null)
std/zig/parse.zig:        if (next_tok.id != Token.Id.LineComment) return result;
std/zig/parse.zig:        if (prev_tok.id == Token.Id.LineComment) continue;
std/zig/ast.zig:            if (self.shebang) |shebang| return shebang;
std/zig/ast.zig:                if (i < 1) return type_node;
std/zig/ast.zig:                if (i < 1) return align_node;
std/zig/ast.zig:                if (i < 1) return init_node;

There are 0 violations of such a rule for for and while loops, but if is an important case to take seriously.

If we required braces on statement-level if, it would turn the 1-line if (something) return; idiom into a 3-line:

if (something) {
    return;
}

This isn't necessarily out of the question, but I don't want to go with this option just yet.

There are a few other cases where braces imply semicolons, but these all have 1-token lookahead, so I don't think anyone's too concerned about these:

comptime foo();
comptime {
    foo();
}

defer foo();
defer {
    foo();
}

suspend; // not sure about this one.
suspend {
    resume @handle();
}

if is the important case to think about.

tgschultz · 2018-10-23T03:21:22Z

There are 0 violations of such a rule for for and while loops,

That's not true, there are a couple. Here's one from std.mem.

pub fn set(comptime T: type, dest: []T, value: T) void {
    for (dest) |*d|
        d.* = value;
}

Hejsil · 2018-10-23T08:58:05Z

@thejoshwolfe

I'm not totally clear on what a context-free grammar is, but I'm assuming that this parametric (body) pattern doesn't qualify for that.

Correct. I think we should stick to BNF for representing our grammar. No EBNF or other variations, as they are less supported in parser generators (and all variations are supersets, so if we keep it simple, we save everyone a lot of pain).

I'm currently using bison + flex to validate my grammar work, and bison can report when there are conflicts (places where bison had to take a choice between two actions). These conflicts tend to be ambiguities, and bison reported one for the block expressions.

This issue is my theory as to why bison had problems. I'm not 100% sure my theory is correct, as bison reports conflicts in the state machine and not in the grammar (which makes it hard to figure out the exact problem). Bison generates a LARL(1) parser so I thought this was a lookahead issue.

I'll look more into this later, when I can work with the grammar again. #1676 is also causing conflicts and those might be interfering with other rules.

There are 0 violations of such a rule for for and while loops, but if is an important case to take seriously.

If we required braces on statement-level if, it would turn the 1-line if (something) return; idiom into a 3-line:

I don't think we have to give up the shorthand versions of the block expressions to resolve this issue.

Also, we have the dangeling else problem :)

Statement = IfStatement | Block | Expr ;
IfStatement = if ( Expr ) Block | if ( Expr ) Expr else IfStatement
Expr = 1 | 1 + Expr | IfExpr | Block
IfExpr = if ( Expr ) Expr | if ( Expr ) Expr else Expr
Block = { Statement }

./cfg-checker grammar.cfg
Found a sentential form with two different derivations:

  if ( Expr ) if ( Expr ) Expr else Expr ;

Derivation 1:

  0: Statement
  1: Expr ;
  2: IfExpr ;
  3: if ( Expr ) Expr else Expr ;
  4: if ( Expr ) IfExpr else Expr ;
  5: if ( Expr ) if ( Expr ) Expr else Expr ;

Derivation 2:

  0: Statement
  1: Expr ;
  2: IfExpr ;
  3: if ( Expr ) Expr ;
  4: if ( Expr ) IfExpr ;
  5: if ( Expr ) if ( Expr ) Expr else Expr ;

Hejsil · 2018-10-23T09:01:27Z

I don't think this is ambiguous. You're supposed to match the patterns in highest-to-lowest precedent as listed. So if Defer(Block) matches, then don't bother trying to match Defer(Expression) ";". Is this not good enough?

Grammars are ambiguous as long as there is one input that has two ways of expanding the grammar. Some parsing techniques do not have the concept of "highest-to-lowest", so they will parse grammars differently from parsers that do if the grammar is ambiguous (LARL parsers expand the grammar into states, where a state can handle N rules at the same time.)

EDIT: The Defer(Block) | Defer(Expression) ";" is not ambiguous, but it does not encode the rule correctly. Both of these examples are valid in our grammar, but stage1 rejects the second defer:

defer {}
defer {}; // error: invalid token: ';'

Hejsil · 2018-10-24T13:03:42Z

More fun grammar that requires N token lookahead:

async<if (true) A else B> fn()void {}
async<comptime A> fn()void {}

thejoshwolfe · 2018-10-24T15:31:40Z

@Hejsil is your work on the flex/bison parser online anywhere? That sounds like it might be a good starting point for overhauling the grammar specification.

Hejsil · 2018-10-24T16:41:47Z

@thejoshwolfe I've pushed my work to here. It can parse most of Zigs std (except async calls and a few statements). Currently, I'm working on identifying what syntactic constructs bison emits conflict warnings for, and seeing if I can restructure the grammar to get rid of them.

Edit: The grammar I'm working on reverts the changes made in #1628 and encodes the rules for #1047 (which makes the grammar simpler)

ghost · 2018-10-27T20:14:57Z

test "" {
    if (true)
        if (true)
            if (true)
                if (true)
                    if (true)
                        if (true)
                            return;
}

I don't see how this requires N token lookahead. I think lookahead refers to reading extra tokens to decide what to do with the already read tokens, for example you must read more before you reduce "a + b". Lookahead will always be required in some cases, so I don't see what's so bad about lookahead.

When you've seen if (expr) { } (in the general case, I know nothing about Zig) then you can immediately reduce this to say an IfStmt. It does not have to look at anything after '}'. The same is true for ';' in the code above. No other token needs to be looked at to collapse the whole stack of if stmts. No token after the semicolon could possibly alter how this is parsed.

Hejsil · 2018-10-27T20:51:49Z

@UniqueID1
N token lookahead (lookahead that scales with input) can be a problem for some parsing stratigies.

Anyways, the real problem in this issue is this, and how to specify Zigs syntax rules in grammar:

EDIT: The Defer(Block) | Defer(Expression) ";" is not ambiguous, but it does not encode the rule correctly. Both of these examples are valid in our grammar, but stage1 rejects the second defer:
defer {}
defer {}; // error: invalid token: ';'

I have a solution to this, but it required changing when ifs can be expressions.

Hejsil · 2018-11-13T13:09:44Z

Fixed: #1685

andrewrk added this to the 0.5.0 milestone Oct 23, 2018

Hejsil mentioned this issue Oct 27, 2018

[WIP] New Zig formal grammar #1685

Merged

Hejsil closed this as completed Nov 13, 2018

andrewrk modified the milestones: 0.5.0, 0.4.0 Apr 8, 2019

momumi mentioned this issue Jan 26, 2020

Always require brackets {} when using if/else/when/for etc. #4294

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Syntax flaw: Block statements and terminating semicolon #1677

Syntax flaw: Block statements and terminating semicolon #1677

Hejsil commented Oct 21, 2018 •

edited

Loading

Hejsil commented Oct 21, 2018

kyle-github commented Oct 22, 2018

Hejsil commented Oct 22, 2018

thejoshwolfe commented Oct 22, 2018

Hejsil commented Oct 22, 2018

Hejsil commented Oct 22, 2018

Hejsil commented Oct 22, 2018

Hejsil commented Oct 22, 2018

thejoshwolfe commented Oct 23, 2018 •

edited

Loading

thejoshwolfe commented Oct 23, 2018

thejoshwolfe commented Oct 23, 2018

tgschultz commented Oct 23, 2018

Hejsil commented Oct 23, 2018

Hejsil commented Oct 23, 2018 •

edited

Loading

Hejsil commented Oct 24, 2018 •

edited

Loading

thejoshwolfe commented Oct 24, 2018

Hejsil commented Oct 24, 2018 •

edited

Loading

ghost commented Oct 27, 2018

Hejsil commented Oct 27, 2018 •

edited

Loading

Hejsil commented Nov 13, 2018

Syntax flaw: Block statements and terminating semicolon #1677

Syntax flaw: Block statements and terminating semicolon #1677

Comments

Hejsil commented Oct 21, 2018 • edited Loading

Hejsil commented Oct 21, 2018

kyle-github commented Oct 22, 2018

Hejsil commented Oct 22, 2018

thejoshwolfe commented Oct 22, 2018

Hejsil commented Oct 22, 2018

Hejsil commented Oct 22, 2018

Hejsil commented Oct 22, 2018

Hejsil commented Oct 22, 2018

thejoshwolfe commented Oct 23, 2018 • edited Loading

thejoshwolfe commented Oct 23, 2018

thejoshwolfe commented Oct 23, 2018

tgschultz commented Oct 23, 2018

Hejsil commented Oct 23, 2018

Hejsil commented Oct 23, 2018 • edited Loading

Hejsil commented Oct 24, 2018 • edited Loading

thejoshwolfe commented Oct 24, 2018

Hejsil commented Oct 24, 2018 • edited Loading

ghost commented Oct 27, 2018

Hejsil commented Oct 27, 2018 • edited Loading

Hejsil commented Nov 13, 2018

Hejsil commented Oct 21, 2018 •

edited

Loading

thejoshwolfe commented Oct 23, 2018 •

edited

Loading

Hejsil commented Oct 23, 2018 •

edited

Loading

Hejsil commented Oct 24, 2018 •

edited

Loading

Hejsil commented Oct 24, 2018 •

edited

Loading

Hejsil commented Oct 27, 2018 •

edited

Loading