-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
is there a specification that Temporal uses to dictate how datetimes are parsed? #2843
Comments
Hi! Some of the champions may be able to answer in a bit more detail the how. From what I've come across during implementation, it is a bit of a hybrid but follows fairly closely to the specification. There's also some small nuances where this differs in 9557 that Temporal is more strict than the specification on the permanently registered keys (ixdtf readme for reference). The specification also supports extended year-month, month-day, and time strings, which isn't entirely introduced in 9557, but derived (so to speak) from the extension. EDIT: I believe the current plan is to use |
Ahhh okay, from unicode-org/icu4x#2127 I now see the ISO 8601 grammar that Temporal uses. I hadn't seen that before. That's exactly what I was looking for and I think precisely answers my question: Temporal uses its own hybrid format. I think it's worth while matching it given there is a precise specification. |
Yep! The ISO 8601 grammar set from the proposal was used to implement the parser over in |
Aye thanks. Other than my dependency philosophy, the main issue there is that it parses from |
@sffc do you know why this decision was made? RFC 9557 strings are guaranteed to be ASCII per the RFC 9557 ABNF.
As noted above, the grammar in the Temporal spec describes acceptable formats that can be parsed from string date/time inputs. These are (modulo a few corner cases) subsets of the syntax defined in RFC 9557. (An RFC that the Temporal champoins helped to author, so the dependency is not accidental!) Beyond the grammar, it may be helpful for you to understand the general principles that underlie how each Temporal type parses values: The general approach throughout Temporal is that each Temporal type requires only the subset of the RFC 9557 syntax that's needed to populate that type. Other components are ignored. For example, In a few cases, this simple subsetting strategy is augmented with a few semantically-important rules:
|
@justingrant Thank you! I gleaned most of that from the Temporal API docs. But there were a few particulars there that I hadn't quite internalized yet (like seconds being optional). I think the bit I was most curious about here were the bits that specifically aren't a subset of RFC 9557, RFC 3339, ISO 8601 or any other specification outside of Temporal. But I think the grammar given in the Temporal spec (which I had indeed missed before asking this question) clears that up for the most part. |
@justingrant There is one non-ascii character in the Temporal grammar specification for the |
My expectation is that it would parse &[u8] but also accept &str in order to support the FromStr trait. If the IXDTF crate doesn't already have a try_from_bytes function, we can add it before we release it. That said, I now have a recollection that the strings are not always ASCII because they allow ndash in place of hyphen for negative numbers. |
Ugh, yeah. Good catch. Although if there's some huge advantage in performance in excluding that, then you may want to file an issue in the Temporal repo to ask if that character could be excluded from the grammar. It would be a shame if some random unicode character that almost no one uses would cause a permanent performance loss and RAM tax. |
The only issue would be that I think |
@sffc Not U+2013 EN DASH, but U+2212 MINUS SIGN.
Strictly speaking yes, but in practice ECMA-262 uses it only in IsTimeZoneOffsetString and ParseTimeZoneOffsetString, which theirselves only receive input from SystemTimeZoneIdentifier so it could still be removed IMO. |
There's no system that might return U+2212 in an identifier? If not and it could be removed, then that would be great. It would essentially align the specification with RFC9557 as well as simplify parsing only having to handle ASCII. |
I can't remember why we decided to allow U+2212. I think it might have been because ISO 8601 allows it, but I'd need to check ISO 8601. |
It has been allowed since at least tc39/ecma262#2781 (comment) , but IIRC we couldn't find a system that definitively supplied an offset time zone identifier at all, let alone one that is not fully ASCII.
Yes, along with comma for the integer–fraction separator. |
I'd be inclined to remove support for U+2212. If it causes compatibility issues down the line it could always be added, but the benefit of guaranteeing an ASCII-only interchange format (and compatibility with RFC 9557) seems pretty large, compared to some theoretical compatibility with an obscure Unicode character. Would anyone on this thread object to removing support for this character in the Temporal grammar? |
Works for me. |
Following ISO-8601, tc39#2781 introduced U+2212 (Unicode minus) as an alias for the regular ASCII minus sign for use in time zone offsets. There's two new data that lead me to believe that this was a mistake, and that we should revert this change. The first is that the newly-released RFC 9557 (the string format standard that Temporal uses) disallows non-ASCII characters. Its predecessor RFC 3339 also disallows non-ASCII characters. So strings that follow the current (since 2022) ECMAScript spec could be rejected by RFC 9557 clients. The second new data is feedback from implementers of a Rust version of Temporal that this single obscure character in the grammar will incur a performance cost because they must now use Rust strings instead of plain U8 ASCII data. See tc39/proposal-temporal#2843 (comment) This performance issue doesn't seem to be limited to Rust. Any native implementation would likely benefit from being able to know that valid date/time input (both Date and Temporal) is always ASCII-only. I don't know whether all engines have actually implemented this 2022 grammar change. But it's also a safe bet that real-world usage of this Unicode character is likely minimal. So the web-compat risk seems small. If this PR is accepted, then we'll follow up with a normative Temporal PR to remove this character from Temporal as well.
Somewhat bad news: tc39/ecma262#2781 introduced U+2212 to ECMA-262 back in 2022. I think there's still a strong case for removing support for this character because it's not valid for the newly-released RFC 9557, but removing it will take more than a Temporal PR. I just filed tc39/ecma262#3334 to see if we can revert this character (added in 2022) from ECMAScript, and if that's approved then we can remove it from Temporal too in a later normative PR. |
Aaaaaand, here's the Temporal PR: #2856 |
Implementation-wise, removing the character would make things slightly easier, but I mean any parser needs to be able to support multi-byte tokens anyway since year/month/day are multi-byte; it's just that this token will be multi-byte with non-ASCII bytes (and in UTF-16 it doesn't need an extra byte). EDIT: To be more clear, given that it's not valid in RFC 9557, and since it will result in slightly lower complexity implementation code, I'm still mildly in favor of removing it from the grammar. |
Oh! What's an example of this? |
Here's the relevant part of ISO 8601:
(ISO/IEC 646 is, approximately, 7-bit ASCII.) It's a bit weird to think of JS as "an environment with a character repertoire based on 7-bit ASCII". |
I mean, |
Oh, I thought you were saying that year/month/day could have multi-byte characters. Sorry for misunderstanding. |
Following ISO-8601, tc39#2781 introduced U+2212 (Unicode minus) as an alias for the regular ASCII minus sign for use in time zone offsets. There's two new data that lead me to believe that this was a mistake, and that we should revert this change. The first is that the newly-released RFC 9557 (the string format standard that Temporal uses) disallows non-ASCII characters. Its predecessor RFC 3339 also disallows non-ASCII characters. So strings that follow the current (since 2022) ECMAScript spec could be rejected by RFC 9557 clients. The second new data is feedback from implementers of a Rust version of Temporal that this single obscure character in the grammar will incur a performance cost because they must now use Rust strings instead of plain U8 ASCII data. See tc39/proposal-temporal#2843 (comment) This performance issue doesn't seem to be limited to Rust. Any native implementation would likely benefit from being able to know that valid date/time input (both Date and Temporal) is always ASCII-only. I don't know whether all engines have actually implemented this 2022 grammar change. But it's also a safe bet that real-world usage of this Unicode character is likely minimal. So the web-compat risk seems small. If this PR is accepted, then we'll follow up with a normative Temporal PR to remove this character from Temporal as well.
Following ISO-8601, tc39#2781 introduced U+2212 (Unicode minus) as an alias for the regular ASCII minus sign for use in time zone offsets. There's two new data that lead me to believe that this was a mistake, and that we should revert this change. The first is that the newly-released RFC 9557 (the string format standard that Temporal uses) disallows non-ASCII characters. Its predecessor RFC 3339 also disallows non-ASCII characters. So strings that follow the current (since 2022) ECMAScript spec could be rejected by RFC 9557 clients. The second new data is feedback from implementers of a Rust version of Temporal that this single obscure character in the grammar will incur a performance cost because they must now use Rust strings instead of plain U8 ASCII data. See tc39/proposal-temporal#2843 (comment) This performance issue doesn't seem to be limited to Rust. Any native implementation would likely benefit from being able to know that valid date/time input (both Date and Temporal) is always ASCII-only. I don't know whether all engines have actually implemented this 2022 grammar change. But it's also a safe bet that real-world usage of this Unicode character is likely minimal. So the web-compat risk seems small. If this PR is accepted, then we'll follow up with a normative Temporal PR to remove this character from Temporal as well.
I am working on a Temporal-inspired datetime library for Rust, and I'm hoping it's appropriate to ask questions here. To set the tone explicitly, this is a "for my own understanding" question and not a "why aren't you doing something different" question.
I'm working on the parsing/printing aspect of my library at the moment, and I'm trying to figure out what kind of semantics I want to offer. I'm looking to Temporal as a guide here, but also the relevant standards (RFCs 3339 and 9557, along with ISO 8601). I came across this case:
which has this output:
My question is: is the format parsed here one of Temporal's own devising, or is there some other relevant standard or relevant precedent informing this behavior? More specifically, as I understand it, the
[America/New_York]
syntax is an extension of RFC 3339 by RFC 9557. But the perplexing part to me here is that the offset in the datetime string is missing. Yet, RFC 3339 specifically requires an offset (orZ
). So it would seem the datetime string format here is invalid from the perspective of RFCs 3339 and 9557. Yet, Temporal supports it. I guess my question is: how did y'all arrive at this format?And I ask because this looks like something one ought to support. So I want to make sure I implement it correctly. Also, I have read the docs on string parsing and formatting, but I don't think it gives a crisp answer to my question here. If I were to guess, I would say that Temporal has its own hybrid format with some of ISO 8601 (which permits
2024-03-01T01:30:00
on its own without an offset) and some of RFC 3339/9557.The text was updated successfully, but these errors were encountered: