Skip to content
This repository has been archived by the owner on Apr 8, 2024. It is now read-only.

Lexical specification? #3

Open
ehuss opened this issue Oct 11, 2018 · 17 comments
Open

Lexical specification? #3

ehuss opened this issue Oct 11, 2018 · 17 comments
Labels
lexer Issues with the lexer

Comments

@ehuss
Copy link
Contributor

ehuss commented Oct 11, 2018

Should the group produce a separate lexical specification?

I think it would be useful, but I have no idea what it would look like.

There are some complexities like token splitting (<< to < <) that rustc performs that I don't know how that would be expressed in a formal grammar. Also, weak keywords. AIUI, raw strings also introduce complexities.

I would like to hear if others think this would be useful, how it would work, what the complications might be, etc.

Also relevant: https://internals.rust-lang.org/t/pre-pre-rfc-canonical-lexer-specification/4099

@CAD97
Copy link
Contributor

CAD97 commented Oct 12, 2018

I believe the plan is to reuse the proc_macro lexer. Interestingly, it doesn't handle token splitting via splitting but rather via joining: Punct represents a single symbol, but reports its Spacing -- whether it's followed directly by the next Punct in the stream.

@Centril
Copy link
Contributor

Centril commented Oct 17, 2018

Reusing the proc_macro lexer as a start seems good; but we should also eventually produce a lexical spec I think for completeness. This will be necessary for the reference and for any spec.

@eddyb
Copy link
Member

eddyb commented Oct 31, 2018

You can bypass the issues with a lexical specification by using a scannerless grammar. Sadly, any correct and complete Rust lexical grammar or scannerless grammar is naturally context-sensitive.

@eternaleye has previously suggested using RCG ("range concatenation grammars") for that purpose and has even given an example of a raw string grammar (sadly I don't have a link on hand).

We also have plans for the gll crate to support some rewriting of certain forms of context-sensitivity into equivalent (but more verbose) context-free grammars, see rust-lang/gll#14 (comment).

@matklad
Copy link
Member

matklad commented Sep 5, 2019

I think librustc_lexer reached the state where it makes sense to provide a specification for it (ie, it's not rustc specific anymore). I think doing the actual spec work is the only thing that's left here, there are no other blockers. I wouldn't do the spec work myself, but I'll be happy to share my thoughts about it 😜

What to spec?

The minimal lexer API is &str -> Result<Vec<Token>, ()>, where Token is defined as follows:

struct Token {
    kind: TokenKind // field-less enum
    len: usize // len > 0;
               // sum of all lengths is equal to the input text length;
               // each partial sum is a char_boundary;
}

That is, is a sequence of Unicode code points is either broken up into segments, or it contains an (unspecified) lexer error. The fact that encoding is utf-8 is the spec layer below, the fact that {} are matched is the layer above. Error recovery is not specified. Upon encountering an error, we return an Err(()).

What are the tokens?

WHITESPACE is a separate TokenKind. There are no dedicated keyword tokens, only IDENTIFIER. There are no composite punctuation tokens, << is LT, LT, < < is LT, WS, LT. literal suffixes are either a part of a literal, or a separate IDENTIFIER token (the latter makes more sense, the former is how proc_macro API looks like). We probably shouldn't explicitly spec the #!/usr/bin/runrust shebang token, and leave it implementation defined instead. For string and character literals, I suggest that we don't specify escaping sequences as a part of lexer specification, and only spec the boundaries of literals (so, only \\ and \" need to be explicitly supported). Escaping should be specified, but in the spec one layer above the lexer. IDENTIFIER should be just (_|XID_Start)XID_Continue*, and should be revised once non-ascii identifiers story is clearer. Line ending should be assumed to be \n. Translating \r\n -> \n is the layer below the lexer.

Some more words on split vs join: because lexer produces whitespace tokens, the split models comes for free. Really, the join model in rustc is wrong, and we should just fix it: rust-lang/rust#63689

Tests

Each test is a pair of text files: an .rs file with sequence of tokens, and an .reference file with some serialization of Result<Vec<Token>, ()> (JSON being the least controversial probably).

It's important to provide a minimal testsuite. That is, using code-coverage tools provide a small set of test-cases that hits all the branches of rustc_lexer. Just dumping all of src/test/ui wouldn't be the maximally useful test suite!

Declarative impl

It should be pretty easy to provide an -> Result<Vec<Token>, ()> API on top of rustc_lexer: just discard token's payload, and make sure to produce Err(()) appropriately. However, I think it is valuable to provide a declarative implementation as well.

For this, I suggest to use regex crate, and specify the lexer as a sequence of regular expressions, with ties explicitly specified. That is, for any two regexes in the spec, no word should belong to one regex lang and be a prefix of a word of another lang, unless there's an explicit precedence for the two regexes (see how LALRPOP does this). Additionally, there's an r#*" regex, which, when matched, calls into a special sub-routine for matching raw strings.

Organization

It would be best if the tests, declarative impl, and harness that runs rustc_lexer and declarative impl against each over, lives in rust-lang/rust/src/librustc_lexer/tests/specification.rs files and is executed as a part of rustc's CI

Future plans

It would be sweet to replace hand-crafted rustc_lexer with a generated state machine!

@CAD97
Copy link
Contributor

CAD97 commented Apr 4, 2020

The number token is an interesting case, because it has a lookahead quality to it, to prevent being greedy. It was mentioned that it may have made sense to break integer literals into parts with a joint flag, the way punctuation is; perhaps it would make sense for the lexer specification to do the same? (Unfortunately, that means it would be different from the TokenStream::parse implementation due to backwards compatibility 🙁)

Anyway, here's an informal transcription of rustc_lexer's current implementation into PCRE regex (with (?Axu) modifiers (Anchored, extended, unicode)) rules (including partially or mal- formed tokens) [regex101]:

(*UTF8)(?Axus)(?(DEFINE)
  # begin unicode monkeypatch
  (?P<Pattern_White_Space> [\t\n\v\f\r\x20\x85\x{200e}\x{200f}\x{2028}\x{2029}])
  (?P<XID_Start> [_\p{L}\p{Nl}]) # approximate
  (?P<XID_Continue> [\pL\p{Nl}\p{Mn}\p{Mc}\p{Nd}\p{Pc}]) # approximate
  # end unicode monkeypatch

  # begin shared internals
  (?P<Single_Quoted_String>
    (?(?=.') .'
    | (?:(?!['/]|\n[^']|\z) (?>\\\\|.))* '?
    )
  )
  (?P<Double_Quoted_String> (?>\\[\\"]|[^"]) "? )
  (?P<Raw_Double_Quoted_String> (?P<hashes>\#*) .*? (?:(?P=hashes)"|\z) )
  # end shared internals

  (?P<LineComment> // [^\n]* )
  (?P<BlockComment>
    /\*
    (?P<_BlockComment_Inner>
      (?: (?!(?:/\*|\*/)) . )*
      (?:(?P>BlockComment) (?P>_BlockComment_Inner))?
    )
    (?:\*/|\z)
  )
  (?P<Whitespace> (?P>Pattern_White_Space)+ )

  (?P<Ident> (?P>XID_Start) (?P>XID_Continue)* )
  (?P<RawIdent> r\# (?P=Ident) )

  (?P<Literal_Int>
    (?: (?:0[bo])? [0-9]+ | 0x[_0-9a-fA-F]* )
    (?! \. (?! \.|(?P>XID_Start) ) [_0-9]* )
    (?! [eE] )
    (?P>Ident)?
  )
  (?P<Literal_Float>
    (?: (?:0[bo])? [0-9]+ | 0x[_0-9a-fA-F]* )
    (?: \. (?! \.|(?P>XID_Start) ) )?
    (?: [eE] [+-]? [_0-9]* )?
    (?P>Ident)?
  )
  (?P<Literal_Char>
    '
    (?(?=.'|(?!(?P>XID_Start)|[0-9]))
      (?(?=[^\\]')
        . '
      | (?P>Single_Quoted_String)
      )
      (?P>Ident)?
    | . (?P>XID_Continue)* '
    )
  )
  (?P<Literal_Byte> b' (?P>Single_Quoted_String) (?P>Ident)?)
  (?P<Literal_Str> " (?P>Double_Quoted_String) (?P>Ident)?)
  (?P<Literal_ByteStr> b" (?P>Double_Quoted_String) (?P>Ident)? )
  (?P<Literal_RawStr> r (?:[#"]) (?P>Raw_Double_Quoted_String) (?P>Ident)? )
  (?P<Literal_RawByteStr> br (?:[#"]) (?P>Raw_Double_Quoted_String) (?P>Ident)? )

  (?P<Lifetime> ' (?:(?P>XID_Start)|[0-9]) (?P>XID_Continue)* (?!') )

  (?P<Semi> ;)
  (?P<Comma> ,)
  (?P<Dot> \.)
  (?P<OpenParen> \()
  (?P<CloseParen> \))
  (?P<OpenBrace> \{)
  (?P<CloseBrace> \})
  (?P<OpenBracket> \[)
  (?P<CloseBracket> \])
  (?P<At> @)
  (?P<Pound> \#)
  (?P<Tilde> ~)
  (?P<Question> \?)
  (?P<Colon> :)
  (?P<Dollar> \$)
  (?P<Eq> =)
  (?P<Not> !)
  (?P<Lt> <)
  (?P<Gt> >)
  (?P<Minus> -)
  (?P<And> &)
  (?P<Or> \|)
  (?P<Plus> \+)
  (?P<Star> \*)
  (?P<Slash> /)
  (?P<Caret> \^)
  (?P<Percent> %)

  (?<Unknown> .)
)

# And the actual lexer
(?> (?P<line_comment>(?P>LineComment))
  | (?P<block_comment>(?P>BlockComment))
  | (?P<whitespace>(?P>Whitespace))
  | (?P<ident>(?P>Ident))
  | (?P<raw_ident>(?P>RawIdent))
  | (?P<literal_int>(?P>Literal_Int))
  | (?P<literal_float>(?P>Literal_Float))
  | (?P<literal_char>(?P>Literal_Char))
  | (?P<literal_str>(?P>Literal_Str))
  | (?P<literal_bytestr>(?P>Literal_ByteStr))
  | (?P<literal_rawstr>(?P>Literal_RawStr))
  | (?P<literal_rawbytestr>(?P>Literal_RawByteStr))
  | (?P<lifetime>(?P>Lifetime))
  | (?P<semi>(?P>Semi))
  | (?P<comma>(?P>Comma))
  | (?P<openparen>(?P>OpenParen))
  | (?P<closeparen>(?P>CloseParen))
  | (?P<openbrace>(?P>OpenBrace))
  | (?P<openbracket>(?P>OpenBracket))
  | (?P<closebracket>(?P>CloseBracket))
  | (?P<at>(?P>At))
  | (?P<pound>(?P>Pound))
  | (?P<tilde>(?P>Tilde))
  | (?P<question>(?P>Question))
  | (?P<colon>(?P>Colon))
  | (?P<dollar>(?P>Dollar))
  | (?P<eq>(?P>Eq))
  | (?P<not>(?P>Not))
  | (?P<lt>(?P>Lt))
  | (?P<gt>(?P>Gt))
  | (?P<minus>(?P>Minus))
  | (?P<and>(?P>And))
  | (?P<or>(?P>Or))
  | (?P<plus>(?P>Plus))
  | (?P<star>(?P>Star))
  | (?P<slash>(?P>Slash))
  | (?P<caret>(?P>Caret))
  | (?P<percent>(?P>Percent))
  | (?P<unknown>(?P>Unknown))
  )

(I am sorry, because this is a monster of a regex. I have put every effort into porting this correctly, but unfortunately, because it's a stupidly complex regex, I can't guarantee that it's correct.)

@eddyb
Copy link
Member

eddyb commented Apr 4, 2020

@CAD97 Btw if you want to test that, the regex crate's RegexSet can be used for lexing.
That is, it tells you which regex cases matched, so you could use that to compare with rustc_lexer.

@CAD97
Copy link
Contributor

CAD97 commented Apr 4, 2020

@eddyb well that uses PCRE features, so the regex crate (rightly) refuses to compile it. Regex101 supports PCRE regexes, and if you click on the link, you'll see that the test case of rustc_lexer is there, being lexed correctly. (except for a missing call to the close brace pattern... whoops)

@eddyb
Copy link
Member

eddyb commented Apr 4, 2020

Oh, I see, you're using named captures (?P<line_comment>), I wasn't paying enough attention.
So I guess my advice is just very general instead, whoops.

@CAD97
Copy link
Contributor

CAD97 commented Apr 4, 2020

(Named captures aren't the issue, regex supports those. It's recursion ((?P>name)), lookahead conditionals ((?(?=)|)), and atomic groups ((?>)) that limits it to PCRE.)

With that rustc_lexer port out of the way, here's the real reason I've been playing with regex:

I have a much more reasonable regex to propose can be used as the lexical specification, as described by @matklad. This regex is a regular regex and accepted by the regex crate.

It does make some simplifying assumptions over rustc_lexer (which is why I made the full monster regex):

  • Does not, in any way, attempt to handle invalid tokens (for later recovery). In fact, as a specification, shouldn't accept incorrect tokens at all. (This includes it validating escapes in char/string literals. This is, somewhat surprisingly, easier than trying to have a sane behavior for bad char/string literals.)
  • Lumps all punctuation into one token kind. Honestly, this is just for shortening the regex.
  • Does not handle string or integer suffixes. Instead this should be handled by token cooking at the next step up, treating a string token followed directly by an identifier token with no intervening tokens as a suffixed string literal.
  • Does not handle float literals. Instead this should be handled by token cooking at the next step up, combining an integer token followed by the other parts of a float literal into a float literal. (This is done to avoid the lookahead behavior of float tokens, where they may or may not consume a period depending on what follows it.)
(?x)
  (?P<line_comment> // [^\n]* )
| (?P<block_comment> /\* ) # parse_block_comment
| (?P<whitespace> \p{Pattern_White_Space}+ )
| (?P<ident> (?:r\#)? \p{XID_Start} \p{XID_Continue}* )
| (?P<lifetime> ' \p{XID_Start} \p{XID_Continue}* )
| (?P<binary_int> 0b [01]+ )
| (?P<octal_int> 0o [0-7]+ )
| (?P<decimal_int> [0-9]+ )
| (?P<hexadecimal_int> 0x [0-9a-fA-F]+ )
| (?P<character> ' (?: [^\\'] | \\ ['"] | \\ [nrt\\0] | \\ x [0-7][0-9a-fA-F] | \\ u \{[0-9a-fA-F]{1,6}} ) ' )
| (?P<string> " (?: [^\\"] | \\ ['"] | \\ [nrt\\0] | \\ x [0-7][0-9a-fA-F] | \\ u \{[0-9a-fA-F]{1,6}} )* " )
| (?P<raw_string>  r \#* " ) # parse_raw_string
| (?P<byte> b ' (?: [^\\'] | \\ ['"] | \\ [nrt\\0] | \\ x [0-9a-fA-F]{2} ) ' )
| (?P<byte_string> b " (?: [^\\"] | \\ ['"] | \\ [nrt\\0] | \\ x [0-9a-fA-F]{2} ) " )
| (?P<raw_byte_string> b r \#* " ) # parse_raw_string
| (?P<punct> [;,.(){}\[\]@\#~?:$=!<>\-&|+*/^%])

Further notes on this regex to transform it into a lexical specification:

  • All regex patterns are anchored, and match the first token of the document. To lex an entire document, repeatedly match the first token of the remaining unlexed document.
  • These arms never overlap, except for the following two cases:
    • A strict prefix of binary_int/octal_int/hexadecimal_int overlaps with decimal_int. Prefer the former, longer match when both match.
    • A strict prefix of character overlaps with lifetime. Prefer the former, longer match when both match.
  • block_comment is nonregular, so relies on an external matcher to finalize the token, which is specified as the following Rust function, taking the string starting at the beginning of the match, and returning the length of the resulting token, and a panic indicating a malformed token:
    pub fn parse_block_comment(s: &str) -> usize {
        let mut chars = s.chars().peekable();
        assert_eq!(chars.next(), Some('/'));
        assert_eq!(chars.next(), Some('*'));
    
        let mut depth = 1usize;
        let mut len = 2;
    
        while depth > 0 {
            match chars.next() {
                Some('/') if matches!(chars.peek(), Some('*')) => {
                    chars.next();
                    depth += 1;
                    len += 2;
                }
                Some('*') if matches!(chars.peek(), Some('/')) => {
                    chars.next();
                    depth -= 1;
                    len += 2;
                }
                Some(c) => len += c.len_utf8(),
                None => panic!("exhausted source in block comment"),
            }
        }
        
        return len;
    }
  • raw_string and raw_byte_string are nonregular, so rely on the same external matcher to finalize the token, which is specified as the following Rust function, taking the string starting at the beginning of the match, and returning the length of the resulting token, and a panic indicating a malformed token:
    pub fn parse_raw_string(s: &str) -> usize {
        use std::convert::TryInto;
    
        let first_hash = s.find('#').unwrap();
        let quote = s.find('"').unwrap();
        let hash_count = quote - first_hash;
    
        let hash_count: u16 = hash_count.try_into()
            .unwrap_or_else(|_| panic!("exhausted source in raw string"));
    
        let hashes = "#".repeat(hash_count as usize);
        let string_end = "\"".to_string() + &hashes;
    
        s.find(&string_end)
            .map(|end| end + string_end.len())
            .unwrap_or_else(|| panic!("exhausted source in raw string"))
    }

(I would write the specification function matchers in pseudocode, LaTeX algorithm/algorithmic style, but we all speak Rust here.)

If this looks appropriate as the lowest level lexer specification, I can and am willing to do the work to pull librustc_lexer down to this level and re-cook this lexer's output into what librustc_parser needs, as well as provide a logos and/or regex driven generated version of librustc_lexer.

The obvious downside is that this would degrade errors 🙁 because the specification lexer should reject all malformed tokens, while a useful lexer should accept a superset of well-formed tokens, and just make sure to error on malformed tokens. (For example, single quote strings and digit lifetimes are accepted by rustc_lexer but error at a later stage, but are (correctly) rejected by the specification lexer above.) So the same way that I doubt the specification grammar will ever be directly used in rustc, I think the best bet would be to maintain two lexers side-by-side: the specification lexer, generated from the specification regex, and the "friendly compiler" lexer, and just maintain a decent test suite and fuzz to make sure that they agree.

(Edit: I haven't actually specifically tested what characters are allowed in a character literal, and the spec above allows anything other than \ or ', which is wrong, because at least a literal newline is rejected in the existing grammar.)

Edit 2: this isn't quite perfect. What breaks it? stringify!(0b). stringify! is a macro that accepts an arbitrary token stream, and that call causes an error for a malformed integer literal, whereas in this lexical spec this is accepted. This is opposed to, say, stringify!(1b), which is accepted.

So who's wrong here? rustc or the proposed spec? Or should this be an error in the "lexer cooking" phase? (What was the behavior pre-librustc_lexer?) The fact that rustc is the spec means rustc is correct, consistency might suggest that this should be accepted.

@matklad
Copy link
Member

matklad commented Apr 4, 2020

Does not, in any way, attempt to handle invalid tokens (for later recovery). In fact, as a specification, shouldn't accept incorrect tokens at all.
The obvious downside is that this would degrade errors slightly_frowning_face

This is very much a benefit, and I'd even say a requirement for spec lexer. The prop that we want to check is:

forall text: &str.
  match (speck_lexer(text), rustc_lexer(text)) {
    (Ok(speck_tokens), Ok(rustc_token)) => speck_tokens == rustc_tokens,
    (Err(()), Err(_)) => true,
    (Ok(_), Err(_)) | (Err(()), Ok(_)) => false
  }

So who's wrong here?

Curious, do we have tests for this in rust-lang/rust? If not, yay for speck finding uncovered shady areas!

@CAD97
Copy link
Contributor

CAD97 commented Apr 4, 2020

There are some tests for this corner case: https://github.com/rust-lang/rust/blob/master/src/test/ui/parser/lex-bad-numeric-literals.rs

But the problem is, all of these errors should error, but that's because they're invalid code, not necessarily because they're invalid tokens.

Digging into this further, I've set up a simple proc macro to let us see the proc_macro::TokenStream generated:

// lib.proc-macro = true
#[proc_macro]
pub fn tokenize(tokens: TokenStream) -> TokenStream {
    panic!("TOKENS = {:#?}", tokens);
}

The fact that this panics is very important, as apparently rustc emits what look like lexer errors for bad number literals (e.g. 0b), but continues on to call any proc macros. Doing this for the numeric literals in the file in question:

tokenize!(
    0o1.0
    0o2f32
    0o3.0f32
    0o4e4
    0o5.0e5
    0o6e6f32
    0o7.0e7f64
    0x8.0e+9
    0x9.0e-9
    0o
    1e+
    0x539.0
    9900000000000000000000000000999999999999999999999999999999
    9900000000000000000000000000999999999999999999999999999999
    0x
    0xu32
    0ou32
    0bu32
    0b
    0o123f64
    0o123.456
    0b101f64
    0b111.101
);
error: octal float literal is not supported (x7)
 4 |     0o1.0
   |     ^^^^^
 6 |     0o3.0f32
   |     ^^^^^
 7 |     0o4e4
   |     ^^^^^
 8 |     0o5.0e5
   |     ^^^^^^^
 9 |     0o6e6f32
   |     ^^^^^
10 |     0o7.0e7f64
   |     ^^^^^^^
24 |     0o123.456
   |     ^^^^^^^^^

error: hexadecimal float literal is not supported (x3)
11 |     0x8.0e+9
   |     ^^^^^^^^
12 |     0x9.0e-9
   |     ^^^^^^^^
15 |     0x539.0
   |     ^^^^^^^

error: no valid digits found for number (x6)
13 |     0o
   |     ^^
18 |     0x
   |     ^^
19 |     0xu32
   |     ^^
20 |     0ou32
   |     ^^
21 |     0bu32
   |     ^^
22 |     0b
   |     ^^

error: expected at least one digit in exponent
14 |     1e+
   |     ^^^

error: binary float literal is not supported
26 |     0b111.101
   |     ^^^^^^^^^

error: proc macro panicked
   = help: message: TOKENS = TokenStream [
               Literal { lit: Lit { kind: Float, symbol: "0o1.0", suffix: None }, .. },
               Literal { lit: Lit { kind: Integer, symbol: "0o2", suffix: Some("f32") }, .. },
               Literal { lit: Lit { kind: Float, symbol: "0o3.0", suffix: Some("f32") }, .. },
               Literal { lit: Lit { kind: Float, symbol: "0o4e4", suffix: None }, .. },
               Literal { lit: Lit { kind: Float, symbol: "0o5.0e5", suffix: None }, .. },
               Literal { lit: Lit { kind: Float, symbol: "0o6e6", suffix: Some("f32") }, .. },
               Literal { lit: Lit { kind: Float, symbol: "0o7.0e7", suffix: Some("f64") }, .. },
               Literal { lit: Lit { kind: Float, symbol: "0x8.0e+9", suffix: None }, .. },
               Literal { lit: Lit { kind: Float, symbol: "0x9.0e-9", suffix: None }, .. },
               Literal { lit: Lit { kind: Integer, symbol: "0", suffix: None }, .. },
               Literal { lit: Lit { kind: Float, symbol: "1e+", suffix: None }, .. },
               Literal { lit: Lit { kind: Float, symbol: "0x539.0", suffix: None }, .. },
               Literal { lit: Lit { kind: Integer, symbol: "9900000000000000000000000000999999999999999999999999999999", suffix: None }, .. },
               Literal { lit: Lit { kind: Integer, symbol: "9900000000000000000000000000999999999999999999999999999999", suffix: None }, .. },
               Literal { lit: Lit { kind: Integer, symbol: "0", suffix: None }, .. },
               Literal { lit: Lit { kind: Integer, symbol: "0", suffix: Some("u32") }, .. },
               Literal { lit: Lit { kind: Integer, symbol: "0", suffix: Some("u32") }, .. },
               Literal { lit: Lit { kind: Integer, symbol: "0", suffix: Some("u32") }, .. },
               Literal { lit: Lit { kind: Integer, symbol: "0", suffix: None }, .. },
               Literal { lit: Lit { kind: Integer, symbol: "0o123", suffix: Some("f64") }, .. },
               Literal { lit: Lit { kind: Float, symbol: "0o123.456", suffix: None }, .. },
               Literal { lit: Lit { kind: Integer, symbol: "0b101", suffix: Some("f64") }, .. },
               Literal { lit: Lit { kind: Float, symbol: "0b111.101", suffix: None }, .. },
           ]

error: aborting due to 19 previous errors

(spans snipped, errors manually compressed)

So there are a few philosophical questions here:

  • Can we change this lex? After all, it currently always emits an error, so compilation fails, even if proc macros are run.
  • Is it correct to error here? After all, once the proc macro has run, the token stream is clean.

And a fun note: put "0o0.0".parse::<TokenStream>().unwrap(); in your proc macro somewhere. The parse will succeed (no panic) but you'll get an error diagnostic for octal float literal is not supported. This almost certainly should not be what happens: either this should be an error creating the TokenStream, and the macro should panic, or this should be a no-op. "0o0.0".parse::<TokenStream>();, which doesn't even assert the validity of the parse, almost certainly should be a no-op.

Now I think any lexer specification is going to need to be three parts:

  • A regex-formalized micro lexer (plus context-free and/or procedural definitions for block comments and raw strings), similar to my above regex set,
  • A specification for the proc_macro::TokenStream data structure, and
  • A translation/cooking routine from the micro lexer's output to the TokenStream data structure.

Having proc_macro::TokenStream stable is an interesting target/obstacle, because it provides an actually observable implementation of just the token layer of the parser, and needs to stay stable to avoid breaking backwards compatibility (though, to be completely fair, there is plans to deprecate proc_macro in favor of meta eventually, right? That would be an opportunity to clean the token stream API up some, so long as the compat layer stays maintained.)

I'm going to make a rough draft of a specification matching this three step process, as well as a canonical implementation to test against proc_macro. (Unfortunately, this means dealing with and accepting malformed character literals the same way rustc_lexer does, which is unfortunately complicated. Hopefully I can find a fully regular way to describe the implemented behavior...)

Edit: whoops, already found an ICE! rust-lang/rust#70787

@Centril
Copy link
Contributor

Centril commented Apr 4, 2020

Lumps all punctuation into one token kind. Honestly, this is just for shortening the regex.

I'd avoid that in the final spec as a matter of making the lexical specification more readable (and so that it gives references for the gluing of the lexical spec into the overall grammar.

| (?P<binary_int> 0b [01]+ )
| (?P<octal_int> 0o [0-7]+ )
| (?P<decimal_int> [0-9]+ )
| (?P<hexadecimal_int> 0x [0-9a-fA-F]+ )

Hmm, are underscores not handled here -- is this part of the cooking?

(I would write the specification function matchers in pseudocode, LaTeX algorithm/algorithmic style, but we all speak Rust here.)

Some denotational semantics perhaps ("function matchers") for the spec (using Rust for conversation is all fine though)? Using Rust as part of the meta-language for specification makes things a bit loopy, which is probably sub-optimal.

Also, doc comments seem to be interpreted as comments in your lexical spec?

@CAD97
Copy link
Contributor

CAD97 commented Apr 4, 2020

@Centril

are underscores not handled here

That, honestly, was me overlooking them. Because e.g. 0b_0 is accepted, underscores should probably be integrated into these matchers.

doc comments seem to be interpreted as comments

This is how librustc_lexer currently does it, and a higher level is in charge of translating doc comments into attributes. We can pull doc comments down to the lexer level or leave them where they are currently; I think I'm loosely in favor of letting the level up from the "micro lexer" handle them.


I've been going all over the place, so here's some concrete questions for the other involved parties:

  • Is rustc running procedural macros with "malformed" token streams (i.e. it has already emitted an error)
    • Important to the spec, and the token stream needs to be properly specified for this case; or
    • A nonstandard extension in the name of better errors, and what token stream the proc macro is given thus has no bearing on the spec.
  • TokenStream::from_str is documented to either panic or return LexError on a bad input, and reserves the right to catch more panics and turn them into LexErrors at a later date. However, for some cases, it has a third option: succeed, but emit an error diagnostic. Is this:
    • A bug, and this should be panicking or returning LexError;
    • A bug, and at least some cases of tokens that panic in this way should be properly specified to succeed; or
    • A nonstandard extension in the name of better errors, and what token stream the proc macro is given thus has no bearing on the spec.

Having mulled on it a little bit more, I think that these cases where rustc is trying to continue after already emitting an error diagnostic should be considered as non-standard attempts to provide better error messages, and specified as errors.

cc @petrochenkov

@Centril
Copy link
Contributor

Centril commented Apr 4, 2020

That, honestly, was me overlooking them.

:)

I think I'm loosely in favor of letting the level up from the "micro lexer" handle them.

This seems like the simpler choice to me in terms of avoiding logic duplication. In general, I'm in favor of the "more languages / IRs and compositions between them"-approach for better separation of concerns.

Is rustc running procedural macros with "malformed" token streams (i.e. it has already emitted an error)

I think we do try to recover, but I'm not sure if we do so on malformed token streams specifically. You may find rust-lang/rust#70074 relevant.

  • A nonstandard extension in the name of better errors, and what token stream the proc macro is given thus has no bearing on the spec.

Having mulled on it a little bit more, I think that these cases where rustc is trying to continue after already emitting an error diagnostic should be considered as non-standard attempts to provide better error messages, and specified as errors.

Recovering + erroring vs. aborting immediately on the first error can be though of as the same thing from the POV of "is this a valid program?" so I think it's not "non-standard", but also not relevant to the spec. (In general, we can think of compilation as being under some error monad, e.g., Option).

@CAD97
Copy link
Contributor

CAD97 commented Apr 6, 2020

It's a very raw draft, but here's a draft for a "raw lexer" specification. Based on experimentation, it seems like the "tricky parts" that require cooking are:

  • suffixed literals, though that could be baked in, if it were not for
  • floating point (is hard and sensitive to (bounded) context)
  • lifetime lifetime is currently a lexer error (suffixed single quote string). Due to how special cased this one special case is, I'm (very) slightly leaning towards recommending this be considered a bug in rustc, and that single character strings that would also be valid as two lifetimes should be treated as two lifetimes in a raw token stream context (macro calls).

The stages I'm currently working with are

  1. The "raw lexer," which is a formally specified parser of UTF-8 strings to raw lexical classes.
  2. The "lexer cooker," which is a formally specified transformation from the raw lexer's output to a cooked token stream.
  3. The "cooked token stream" data structure, which roughly corresponds to proc_macro::TokenStream, but still keeps track of whitespace and comments, and does not yet match brackets.

@CAD97
Copy link
Contributor

CAD97 commented Apr 21, 2020

The lexical specification draft now currently lives in https://github.com/CAD97/rust-lexical-spec.

I'd appreciate any more eyes on it while it's still a draft; I'll probably PR it to this repository once I've got a reference impl for the whole path and it tests against rustc_lexer.

@naturallymitchell
Copy link

naturallymitchell commented Jun 6, 2020

What do you think of using pest. The Elegant Parser and a parser expression grammar (PEG)?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lexer Issues with the lexer
Projects
None yet
Development

No branches or pull requests

6 participants