Inconsistencies in floating point parsing #370

maurges · 2022-02-27T07:32:41Z

Grammar taken from current master. There are a couple inconcistencies between the grammar, the current implementation, and from how rust does it.

Exponential notation. The grammar allows 1.0e1, but disallows 1e1 and bizzarely disallows sign: 1.0e+1 and 1.0e-1. But the implementation accepts all those forms, and so does rust.
Underscores are disallowed in floats, and this is consistent in both the grammar and the implementation; but it differs from rust and is a bit unexpected.

Here's small repro for everything:

type R<T> = ron::Result<T>;

fn main() {
    let x1: R<i64> = ron::from_str("1_0");
    println!("{:?}", &x1);
    // >>> Ok(10)

    let x2: R<f64> = ron::from_str("1_0.1_0");
    println!("{:?}", &x2);
    // >>> Err(Error { code: TrailingCharacters, position: Position { line: 1, col: 2 } })

    let x3: R<f64> = ron::from_str("1_0.10");
    println!("{:?}", &x3);
    // >>> Err(Error { code: TrailingCharacters, position: Position { line: 1, col: 2 } })

    let x4: R<f64> = ron::from_str("10.1_0");
    println!("{:?}", &x4);
    // >>> Err(Error { code: TrailingCharacters, position: Position { line: 1, col: 5 } })

    let x5: R<f64> = ron::from_str("1.0e-1");
    println!("{:?}", &x5);
    // >>> Ok(10.1)

    let y1: R<f64> = ron::from_str("1.0e1");
    println!("{:?}", &y1);
    // >>> Ok(10.0)

    let y2: R<f64> = ron::from_str("1e1");
    println!("{:?}", &y2);
    // >>> Ok(10.0)

    let y3: R<f64> = ron::from_str("1.0e1.0");
    println!("{:?}", &y3);
    // >>> Err(Error { code: ExpectedFloat, position: Position { line: 1, col: 1 } })

    let y4: R<f64> = ron::from_str("1.0e+1");
    println!("{:?}", &y4);
    // >>> Ok(10.0)

    let y5: R<f64> = ron::from_str("1.0e-1");
    println!("{:?}", &y5);
    // >>> Ok(0.1)
}
// y3 being Err is alright I believe, but the error message is weird

The text was updated successfully, but these errors were encountered:

juntyr · 2022-02-27T10:13:41Z

Thank you for the report @d86leader! I've done a bit of investigation to see why this is happening:

You are correct about those, I think the grammar should be:

float = "inf" | "-inf" | "NaN" | float_num;
float_num = ["+" | "-"], (float_std | float_frac | float_int);
float_std = digit, { digit }, ".", {digit}, [float_exp];
float_frac = ".", digit, {digit}, [float_exp];
float_int = digit, { digit }, [float_exp];
float_exp = ("e" | "E"), ["+" | "-"], digit, {digit};

The problem stems from how rustc parses floating literals vs how seemingly every library, including std with FromStr (which we use) parses them. While rustc allows underscores, nom, lexical, fast_float, and FromStr all do not. Furthermore, while FromStr allows .1e7 (the float_frac rule), rustc does not.

It seems like at the moment the best strategy would be to (a) fix the grammar and (b) add some tests to ensure that the grammar represents the implemented behaviour.

To improve the errors, we could also allow the parser to eat up underscores when checking for a float but then add a descriptive error message saying that we cannot parse them at the moment. Another possibility would of course be to parse and ignore them, which would allow us to parse every valid float literal in rustc but also accept a few ones that rustc would not allow. Finally, we could also implement the check for where underscores are allowed - after a digit, an underscore, or the (e|E)[+|-] in the exponent.

Eventually, it would of course be great if FromStr and rustc would agree and we could parse the same format.

juntyr · 2022-02-27T10:29:14Z

Ok, here is the EBNF from FromStr:

Float  ::= Sign? ( 'inf' | 'NaN' | Number )
Number ::= ( Digit+ |
             Digit+ '.' Digit* |
             Digit* '.' Digit+ ) Exp?
Exp    ::= [eE] Sign? Digit+
Sign   ::= [+-]
Digit  ::= [0-9]

juntyr · 2022-02-27T10:38:31Z

And here is rustc's EBNF (I've excluded the float suffix, see #241)

FLOAT_LITERAL ::=
     DEC_LITERAL .
   | DEC_LITERAL FLOAT_EXPONENT
   | DEC_LITERAL . DEC_LITERAL FLOAT_EXPONENT?

FLOAT_EXPONENT ::=
   (e|E) (+|-)? (DEC_DIGIT|_)* DEC_DIGIT (DEC_DIGIT|_)*

DEC_DIGIT ::= [0-9]

DEC_LITERAL ::= DEC_DIGIT (DEC_DIGIT|_)*

Rust doesn't support a leading + but does support a leading - which is parsed as a separate token.

juntyr · 2022-02-27T10:49:01Z

Ok, my proposal would now be to (1) adapt FromStr's EBNF rules, implement the cases which are not yet correct, and add some unit tests. (2) Catch underscores in floats as a special error case but do not allow them (since no other parser I checked seems to).

juntyr · 2022-02-27T10:55:17Z

So this would be the new EBNF:

float = ["+" | "-"], ("inf" | "NaN" | float_num);
float_num = (float_int | float_std | float_frac), [float_exp];
float_int = digit, { digit };
float_std = digit, { digit }, ".", {digit};
float_frac = ".", digit, {digit};
float_exp = ("e" | "E"), ["+" | "-"], digit, {digit};

maurges · 2022-02-27T11:13:50Z

Looking at this EBNF, I thought of another funny edge case: the ambiguity of NaN and inf between floats and keywords. The parser seems to resolve this ambiguity by seeing what the type itself requires:

type R<T> = ron::Result<T>;

#[derive(Debug, serde::Deserialize, serde::Serialize)]
struct NaN;

fn main() {
    let s = "NaN";

    let x1: R<NaN> = ron::from_str(s);
    let x2: R<f64> = ron::from_str(s);
    println!("{:?}\n{:?}", x1, x2);
    // >>> Ok(NaN)
    //     Ok(NaN)
}

juntyr · 2022-02-27T11:22:16Z

Yes, that is correct. Since RON is not really self-describing, it will trust you in this case. Can we confuse it - yes we can:

#[derive(serde::Deserialize)]
struct NaN;

#[derive(serde::Deserialize)]
#[serde(untagged)]
enum Trolol {
    Lol(NaN),
    F64(f64),
}

#[test]
fn test() {
    assert!(matches!(ron::from_str("NaN"), Ok(Trolol::F64(f)) if f.is_nan()));
}

juntyr · 2022-02-27T11:32:04Z

@d86leader Could you check if the PR would resolve your issues? Thanks!

maurges · 2022-02-27T13:02:00Z

Thanks! The new grammar looks good and seems to reflect the implementation

juntyr added a commit to juntyr/ron that referenced this issue Feb 27, 2022

Fix ron-rs#370 with better float EBNF and +inf -NaN +NaN parsing

31a4b68

juntyr mentioned this issue Feb 27, 2022

Fix #370 with better float EBNF and +inf -NaN +NaN parsing #371

Merged

1 task

juntyr self-assigned this Feb 27, 2022

kvark closed this as completed in #371 Feb 28, 2022

kvark pushed a commit that referenced this issue Feb 28, 2022

Fix #370 with better float EBNF and +inf -NaN +NaN parsing

4d16313

torkleyy pushed a commit to torkleyy/ron that referenced this issue Jun 6, 2022

Fix ron-rs#370 with better float EBNF and +inf -NaN +NaN parsing

586eadd

torkleyy mentioned this issue Jun 6, 2022

Release v0.7.1 #382

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistencies in floating point parsing #370

Inconsistencies in floating point parsing #370

maurges commented Feb 27, 2022 •

edited

Loading

juntyr commented Feb 27, 2022 •

edited

Loading

juntyr commented Feb 27, 2022

juntyr commented Feb 27, 2022 •

edited

Loading

juntyr commented Feb 27, 2022

juntyr commented Feb 27, 2022

maurges commented Feb 27, 2022

juntyr commented Feb 27, 2022

juntyr commented Feb 27, 2022

maurges commented Feb 27, 2022

Inconsistencies in floating point parsing #370

Inconsistencies in floating point parsing #370

Comments

maurges commented Feb 27, 2022 • edited Loading

juntyr commented Feb 27, 2022 • edited Loading

juntyr commented Feb 27, 2022

juntyr commented Feb 27, 2022 • edited Loading

juntyr commented Feb 27, 2022

juntyr commented Feb 27, 2022

maurges commented Feb 27, 2022

juntyr commented Feb 27, 2022

juntyr commented Feb 27, 2022

maurges commented Feb 27, 2022

maurges commented Feb 27, 2022 •

edited

Loading

juntyr commented Feb 27, 2022 •

edited

Loading

juntyr commented Feb 27, 2022 •

edited

Loading