Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistencies in floating point parsing #370

Closed
maurges opened this issue Feb 27, 2022 · 9 comments · Fixed by #371
Closed

Inconsistencies in floating point parsing #370

maurges opened this issue Feb 27, 2022 · 9 comments · Fixed by #371
Assignees

Comments

@maurges
Copy link

maurges commented Feb 27, 2022

Grammar taken from current master. There are a couple inconcistencies between the grammar, the current implementation, and from how rust does it.

  1. Exponential notation. The grammar allows 1.0e1, but disallows 1e1 and bizzarely disallows sign: 1.0e+1 and 1.0e-1. But the implementation accepts all those forms, and so does rust.

  2. Underscores are disallowed in floats, and this is consistent in both the grammar and the implementation; but it differs from rust and is a bit unexpected.

Here's small repro for everything:

type R<T> = ron::Result<T>;

fn main() {
    let x1: R<i64> = ron::from_str("1_0");
    println!("{:?}", &x1);
    // >>> Ok(10)

    let x2: R<f64> = ron::from_str("1_0.1_0");
    println!("{:?}", &x2);
    // >>> Err(Error { code: TrailingCharacters, position: Position { line: 1, col: 2 } })

    let x3: R<f64> = ron::from_str("1_0.10");
    println!("{:?}", &x3);
    // >>> Err(Error { code: TrailingCharacters, position: Position { line: 1, col: 2 } })

    let x4: R<f64> = ron::from_str("10.1_0");
    println!("{:?}", &x4);
    // >>> Err(Error { code: TrailingCharacters, position: Position { line: 1, col: 5 } })

    let x5: R<f64> = ron::from_str("1.0e-1");
    println!("{:?}", &x5);
    // >>> Ok(10.1)

    let y1: R<f64> = ron::from_str("1.0e1");
    println!("{:?}", &y1);
    // >>> Ok(10.0)

    let y2: R<f64> = ron::from_str("1e1");
    println!("{:?}", &y2);
    // >>> Ok(10.0)

    let y3: R<f64> = ron::from_str("1.0e1.0");
    println!("{:?}", &y3);
    // >>> Err(Error { code: ExpectedFloat, position: Position { line: 1, col: 1 } })

    let y4: R<f64> = ron::from_str("1.0e+1");
    println!("{:?}", &y4);
    // >>> Ok(10.0)

    let y5: R<f64> = ron::from_str("1.0e-1");
    println!("{:?}", &y5);
    // >>> Ok(0.1)
}
// y3 being Err is alright I believe, but the error message is weird
@juntyr
Copy link
Member

juntyr commented Feb 27, 2022

Thank you for the report @d86leader! I've done a bit of investigation to see why this is happening:

  1. You are correct about those, I think the grammar should be:
float = "inf" | "-inf" | "NaN" | float_num;
float_num = ["+" | "-"], (float_std | float_frac | float_int);
float_std = digit, { digit }, ".", {digit}, [float_exp];
float_frac = ".", digit, {digit}, [float_exp];
float_int = digit, { digit }, [float_exp];
float_exp = ("e" | "E"), ["+" | "-"], digit, {digit};
  1. The problem stems from how rustc parses floating literals vs how seemingly every library, including std with FromStr (which we use) parses them. While rustc allows underscores, nom, lexical, fast_float, and FromStr all do not. Furthermore, while FromStr allows .1e7 (the float_frac rule), rustc does not.

It seems like at the moment the best strategy would be to (a) fix the grammar and (b) add some tests to ensure that the grammar represents the implemented behaviour.

To improve the errors, we could also allow the parser to eat up underscores when checking for a float but then add a descriptive error message saying that we cannot parse them at the moment. Another possibility would of course be to parse and ignore them, which would allow us to parse every valid float literal in rustc but also accept a few ones that rustc would not allow. Finally, we could also implement the check for where underscores are allowed - after a digit, an underscore, or the (e|E)[+|-] in the exponent.

Eventually, it would of course be great if FromStr and rustc would agree and we could parse the same format.

@juntyr
Copy link
Member

juntyr commented Feb 27, 2022

Ok, here is the EBNF from FromStr:

Float  ::= Sign? ( 'inf' | 'NaN' | Number )
Number ::= ( Digit+ |
             Digit+ '.' Digit* |
             Digit* '.' Digit+ ) Exp?
Exp    ::= [eE] Sign? Digit+
Sign   ::= [+-]
Digit  ::= [0-9]

@juntyr
Copy link
Member

juntyr commented Feb 27, 2022

And here is rustc's EBNF (I've excluded the float suffix, see #241)

FLOAT_LITERAL ::=
     DEC_LITERAL .
   | DEC_LITERAL FLOAT_EXPONENT
   | DEC_LITERAL . DEC_LITERAL FLOAT_EXPONENT?

FLOAT_EXPONENT ::=
   (e|E) (+|-)? (DEC_DIGIT|_)* DEC_DIGIT (DEC_DIGIT|_)*

DEC_DIGIT ::= [0-9]

DEC_LITERAL ::= DEC_DIGIT (DEC_DIGIT|_)*

Rust doesn't support a leading + but does support a leading - which is parsed as a separate token.

@juntyr
Copy link
Member

juntyr commented Feb 27, 2022

Ok, my proposal would now be to (1) adapt FromStr's EBNF rules, implement the cases which are not yet correct, and add some unit tests. (2) Catch underscores in floats as a special error case but do not allow them (since no other parser I checked seems to).

@juntyr
Copy link
Member

juntyr commented Feb 27, 2022

So this would be the new EBNF:

float = ["+" | "-"], ("inf" | "NaN" | float_num);
float_num = (float_int | float_std | float_frac), [float_exp];
float_int = digit, { digit };
float_std = digit, { digit }, ".", {digit};
float_frac = ".", digit, {digit};
float_exp = ("e" | "E"), ["+" | "-"], digit, {digit};

@maurges
Copy link
Author

maurges commented Feb 27, 2022

Looking at this EBNF, I thought of another funny edge case: the ambiguity of NaN and inf between floats and keywords. The parser seems to resolve this ambiguity by seeing what the type itself requires:

type R<T> = ron::Result<T>;

#[derive(Debug, serde::Deserialize, serde::Serialize)]
struct NaN;

fn main() {
    let s = "NaN";

    let x1: R<NaN> = ron::from_str(s);
    let x2: R<f64> = ron::from_str(s);
    println!("{:?}\n{:?}", x1, x2);
    // >>> Ok(NaN)
    //     Ok(NaN)
}

@juntyr
Copy link
Member

juntyr commented Feb 27, 2022

Yes, that is correct. Since RON is not really self-describing, it will trust you in this case. Can we confuse it - yes we can:

#[derive(serde::Deserialize)]
struct NaN;

#[derive(serde::Deserialize)]
#[serde(untagged)]
enum Trolol {
    Lol(NaN),
    F64(f64),
}

#[test]
fn test() {
    assert!(matches!(ron::from_str("NaN"), Ok(Trolol::F64(f)) if f.is_nan()));
}

@juntyr
Copy link
Member

juntyr commented Feb 27, 2022

@d86leader Could you check if the PR would resolve your issues? Thanks!

@juntyr juntyr self-assigned this Feb 27, 2022
@maurges
Copy link
Author

maurges commented Feb 27, 2022

Thanks! The new grammar looks good and seems to reflect the implementation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants