Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

is there a specification that Temporal uses to dictate how datetimes are parsed? #2843

Closed
BurntSushi opened this issue May 17, 2024 · 23 comments
Labels

Comments

@BurntSushi
Copy link

BurntSushi commented May 17, 2024

I am working on a Temporal-inspired datetime library for Rust, and I'm hoping it's appropriate to ask questions here. To set the tone explicitly, this is a "for my own understanding" question and not a "why aren't you doing something different" question.

I'm working on the parsing/printing aspect of my library at the moment, and I'm trying to figure out what kind of semantics I want to offer. I'm looking to Temporal as a guide here, but also the relevant standards (RFCs 3339 and 9557, along with ISO 8601). I came across this case:

let dt = Temporal.ZonedDateTime.from("2024-03-01T01:30:00[America/New_York]");
console.log(dt.toString());

which has this output:

2024-03-01T01:30:00-05:00[America/New_York]

My question is: is the format parsed here one of Temporal's own devising, or is there some other relevant standard or relevant precedent informing this behavior? More specifically, as I understand it, the [America/New_York] syntax is an extension of RFC 3339 by RFC 9557. But the perplexing part to me here is that the offset in the datetime string is missing. Yet, RFC 3339 specifically requires an offset (or Z). So it would seem the datetime string format here is invalid from the perspective of RFCs 3339 and 9557. Yet, Temporal supports it. I guess my question is: how did y'all arrive at this format?

And I ask because this looks like something one ought to support. So I want to make sure I implement it correctly. Also, I have read the docs on string parsing and formatting, but I don't think it gives a crisp answer to my question here. If I were to guess, I would say that Temporal has its own hybrid format with some of ISO 8601 (which permits 2024-03-01T01:30:00 on its own without an offset) and some of RFC 3339/9557.

@nekevss
Copy link

nekevss commented May 18, 2024

Hi! Some of the champions may be able to answer in a bit more detail the how. From what I've come across during implementation, it is a bit of a hybrid but follows fairly closely to the specification. There's also some small nuances where this differs in 9557 that Temporal is more strict than the specification on the permanently registered keys (ixdtf readme for reference).

The specification also supports extended year-month, month-day, and time strings, which isn't entirely introduced in 9557, but derived (so to speak) from the extension.

EDIT: I believe the current plan is to use ixdtf in icu_calendar and temporal_rs if you're at all interested in using that crate once it's released.

@BurntSushi
Copy link
Author

BurntSushi commented May 18, 2024

Ahhh okay, from unicode-org/icu4x#2127 I now see the ISO 8601 grammar that Temporal uses. I hadn't seen that before. That's exactly what I was looking for and I think precisely answers my question: Temporal uses its own hybrid format. I think it's worth while matching it given there is a precise specification.

@nekevss
Copy link

nekevss commented May 18, 2024

Ahhh okay, from unicode-org/icu4x#2127 I now see the ISO 8601 grammar that Temporal uses

Yep! The ISO 8601 grammar set from the proposal was used to implement the parser over in temporal_rs. That parser was the initial base for the current implementation of ixdtf in icu4x. So the baseline support should be there for temporal if you want another implementation example 😄

@BurntSushi
Copy link
Author

Aye thanks. Other than my dependency philosophy, the main issue there is that it parses from &str instead of &[u8]. So every parse will necessarily require UTF-8 validation. And that puts a ceiling on perf.

@justingrant
Copy link
Collaborator

it parses from &str instead of &[u8]

@sffc do you know why this decision was made? RFC 9557 strings are guaranteed to be ASCII per the RFC 9557 ABNF.

Temporal uses its own hybrid format. I think it's worthwhile matching it given there is a precise specification.

As noted above, the grammar in the Temporal spec describes acceptable formats that can be parsed from string date/time inputs. These are (modulo a few corner cases) subsets of the syntax defined in RFC 9557. (An RFC that the Temporal champoins helped to author, so the dependency is not accidental!)

Beyond the grammar, it may be helpful for you to understand the general principles that underlie how each Temporal type parses values:

The general approach throughout Temporal is that each Temporal type requires only the subset of the RFC 9557 syntax that's needed to populate that type. Other components are ignored. For example, Temporal.PlainDate.from accepts "2024-05-19" but also "2024-05-19T13:47" and "2024-05-19T13:47[u-ca=chinese]" and "2024-05-19T13:47-07:00" and "2024-05-19T13:47-07:00[America/San_Francisco]" and "2024-05-19T13:47-07:00[America/San_Francisco][u-ca=chinese]". And probably a few other combinations, as long as the core year/month/day values are present in the string.

In a few cases, this simple subsetting strategy is augmented with a few semantically-important rules:

  • Per RFC 9557, the Z offset means "I know the instant but not the local time nor the local offset", so you can't parse a string using a Z offset with any of the Plain* types, because those types are, by definition, focused on local time not epoch-relative timestamps.
  • Time values can be omitted and are defaulted to 00:00. For example, Temporal.PlainDateTime.from('2024-05-19') is equivalent to Temporal.PlainDateTime.from('2024-05-19T00:00').
  • When parsing a Temporal.ZonedDateTime value, if the offset is omitted or if the offset is Z, then the offset is taken from the time zone in the string. So Temporal.ZonedDateTime.from("2024-03-01T01:30:00[America/New_York]") and Temporal.ZonedDateTime.from("024-03-01T06:30:00Z[America/New_York]") are equivalent to Temporal.ZonedDateTime.from("2024-03-01T01:30:00-05:00[America/New_York]"). This is equivalent to the behavior of the offset: 'ignore' option.
  • Seconds and smaller values can be omitted from time values.
  • IIRC, there's some oddness around parsing of Temporal.PlainMonthDay values to avoid ambiguous values, but I don't remember those details at the moment.

@BurntSushi
Copy link
Author

BurntSushi commented May 20, 2024

@justingrant Thank you! I gleaned most of that from the Temporal API docs. But there were a few particulars there that I hadn't quite internalized yet (like seconds being optional). I think the bit I was most curious about here were the bits that specifically aren't a subset of RFC 9557, RFC 3339, ISO 8601 or any other specification outside of Temporal. But I think the grammar given in the Temporal spec (which I had indeed missed before asking this question) clears that up for the most part.

@nekevss
Copy link

nekevss commented May 20, 2024

@justingrant There is one non-ascii character in the Temporal grammar specification for the TemporalSign (U+2212) that has to be handled to support the Temporal version vs. strict to the RFC9557 spec.

@sffc
Copy link
Collaborator

sffc commented May 20, 2024

it parses from &str instead of &[u8]

@sffc do you know why this decision was made? RFC 9557 strings are guaranteed to be ASCII per the RFC 9557 ABNF.

My expectation is that it would parse &[u8] but also accept &str in order to support the FromStr trait. If the IXDTF crate doesn't already have a try_from_bytes function, we can add it before we release it.

That said, I now have a recollection that the strings are not always ASCII because they allow ndash in place of hyphen for negative numbers.

@justingrant
Copy link
Collaborator

@justingrant There is one non-ascii character in the Temporal grammar specification for the TemporalSign (U+2212) that has to be handled to support the Temporal version vs. strict to the RFC9557 spec.

Ugh, yeah. Good catch. Although if there's some huge advantage in performance in excluding that, then you may want to file an issue in the Temporal repo to ask if that character could be excluded from the grammar. It would be a shame if some random unicode character that almost no one uses would cause a permanent performance loss and RAM tax.

@nekevss
Copy link

nekevss commented May 20, 2024

The only issue would be that I think TemporalSign is in the current ECMAScript specification. Unless that specification is the upcoming one. Looks like it may be needed for compatibility with Date's TimeZone offset string format.

@gibson042
Copy link
Collaborator

That said, I now have a recollection that the strings are not always ASCII because they allow ndash in place of hyphen for negative numbers.

@sffc Not U+2013 EN DASH, but U+2212 MINUS SIGN.

The only issue would be that I think TemporalSign is in the current ECMAScript specification. Unless that specification is the upcoming one. Looks like it may be needed for compatibility with Date's TimeZone offset string format.

Strictly speaking yes, but in practice ECMA-262 uses it only in IsTimeZoneOffsetString and ParseTimeZoneOffsetString, which theirselves only receive input from SystemTimeZoneIdentifier so it could still be removed IMO.

@nekevss
Copy link

nekevss commented May 20, 2024

Strictly speaking yes, but in practice ECMA-262 uses it only in IsTimeZoneOffsetString and ParseTimeZoneOffsetString, which theirselves only receive input from SystemTimeZoneIdentifier so it could still be removed IMO

There's no system that might return U+2212 in an identifier? If not and it could be removed, then that would be great. It would essentially align the specification with RFC9557 as well as simplify parsing only having to handle ASCII.

@ptomato
Copy link
Collaborator

ptomato commented May 20, 2024

I can't remember why we decided to allow U+2212. I think it might have been because ISO 8601 allows it, but I'd need to check ISO 8601.

@gibson042
Copy link
Collaborator

There's no system that might return U+2212 in an identifier?

It has been allowed since at least tc39/ecma262#2781 (comment) , but IIRC we couldn't find a system that definitively supplied an offset time zone identifier at all, let alone one that is not fully ASCII.

I can't remember why we decided to allow U+2212. I think it might have been because ISO 8601 allows it, but I'd need to check ISO 8601.

Yes, along with comma for the integer–fraction separator.

@justingrant
Copy link
Collaborator

I'd be inclined to remove support for U+2212. If it causes compatibility issues down the line it could always be added, but the benefit of guaranteeing an ASCII-only interchange format (and compatibility with RFC 9557) seems pretty large, compared to some theoretical compatibility with an obscure Unicode character.

Would anyone on this thread object to removing support for this character in the Temporal grammar?

@gibson042
Copy link
Collaborator

Works for me.

justingrant added a commit to justingrant/ecma262 that referenced this issue May 23, 2024
Following ISO-8601, tc39#2781 introduced U+2212 (Unicode minus) as an alias
for the regular ASCII minus sign for use in time zone offsets.

There's two new data that lead me to believe that this was a mistake,
and that we should revert this change.

The first is that the newly-released RFC 9557 (the string format
standard that Temporal uses) disallows non-ASCII characters. Its
predecessor RFC 3339 also disallows non-ASCII characters. So
strings that follow the current (since 2022) ECMAScript spec
could be rejected by RFC 9557 clients.

The second new data is feedback from implementers of a Rust version of
Temporal that this single obscure character in the grammar will incur a
performance cost because they must now use Rust strings instead
of plain U8 ASCII data. See
tc39/proposal-temporal#2843 (comment)

This performance issue doesn't seem to be limited to Rust. Any
native implementation would likely benefit from being able to know that
valid date/time input (both Date and Temporal) is always ASCII-only.

I don't know whether all engines have actually implemented this 2022
grammar change. But it's also a safe bet that real-world usage of this
Unicode character is likely minimal. So the web-compat risk seems small.

If this PR is accepted, then we'll follow up with a normative Temporal
PR to remove this character from Temporal as well.
@justingrant
Copy link
Collaborator

Somewhat bad news: tc39/ecma262#2781 introduced U+2212 to ECMA-262 back in 2022. I think there's still a strong case for removing support for this character because it's not valid for the newly-released RFC 9557, but removing it will take more than a Temporal PR.

I just filed tc39/ecma262#3334 to see if we can revert this character (added in 2022) from ECMAScript, and if that's approved then we can remove it from Temporal too in a later normative PR.

@justingrant
Copy link
Collaborator

Aaaaaand, here's the Temporal PR: #2856

@sffc
Copy link
Collaborator

sffc commented May 23, 2024

Implementation-wise, removing the character would make things slightly easier, but I mean any parser needs to be able to support multi-byte tokens anyway since year/month/day are multi-byte; it's just that this token will be multi-byte with non-ASCII bytes (and in UTF-16 it doesn't need an extra byte).

EDIT: To be more clear, given that it's not valid in RFC 9557, and since it will result in slightly lower complexity implementation code, I'm still mildly in favor of removing it from the grammar.

@justingrant
Copy link
Collaborator

year/month/day are multi-byte

Oh! What's an example of this?

@ptomato
Copy link
Collaborator

ptomato commented May 23, 2024

Here's the relevant part of ISO 8601:

In an environment where use is made of a character repertoire based on ISO/IEC 646, “hyphen” and “minus” should be both mapped onto “hyphen-minus”.

(ISO/IEC 646 is, approximately, 7-bit ASCII.)

It's a bit weird to think of JS as "an environment with a character repertoire based on 7-bit ASCII".

@sffc
Copy link
Collaborator

sffc commented May 24, 2024

year/month/day are multi-byte

Oh! What's an example of this?

I mean, 2024 is 4 code units. A tokenizer needs to operate on ranges, not on single code units

@justingrant
Copy link
Collaborator

Oh, I thought you were saying that year/month/day could have multi-byte characters. Sorry for misunderstanding.

justingrant added a commit to justingrant/ecma262 that referenced this issue Jul 2, 2024
Following ISO-8601, tc39#2781 introduced U+2212 (Unicode minus) as an alias
for the regular ASCII minus sign for use in time zone offsets.

There's two new data that lead me to believe that this was a mistake,
and that we should revert this change.

The first is that the newly-released RFC 9557 (the string format
standard that Temporal uses) disallows non-ASCII characters. Its
predecessor RFC 3339 also disallows non-ASCII characters. So
strings that follow the current (since 2022) ECMAScript spec
could be rejected by RFC 9557 clients.

The second new data is feedback from implementers of a Rust version of
Temporal that this single obscure character in the grammar will incur a
performance cost because they must now use Rust strings instead
of plain U8 ASCII data. See
tc39/proposal-temporal#2843 (comment)

This performance issue doesn't seem to be limited to Rust. Any
native implementation would likely benefit from being able to know that
valid date/time input (both Date and Temporal) is always ASCII-only.

I don't know whether all engines have actually implemented this 2022
grammar change. But it's also a safe bet that real-world usage of this
Unicode character is likely minimal. So the web-compat risk seems small.

If this PR is accepted, then we'll follow up with a normative Temporal
PR to remove this character from Temporal as well.
ljharb pushed a commit to justingrant/ecma262 that referenced this issue Jul 17, 2024
Following ISO-8601, tc39#2781 introduced U+2212 (Unicode minus) as an alias
for the regular ASCII minus sign for use in time zone offsets.

There's two new data that lead me to believe that this was a mistake,
and that we should revert this change.

The first is that the newly-released RFC 9557 (the string format
standard that Temporal uses) disallows non-ASCII characters. Its
predecessor RFC 3339 also disallows non-ASCII characters. So
strings that follow the current (since 2022) ECMAScript spec
could be rejected by RFC 9557 clients.

The second new data is feedback from implementers of a Rust version of
Temporal that this single obscure character in the grammar will incur a
performance cost because they must now use Rust strings instead
of plain U8 ASCII data. See
tc39/proposal-temporal#2843 (comment)

This performance issue doesn't seem to be limited to Rust. Any
native implementation would likely benefit from being able to know that
valid date/time input (both Date and Temporal) is always ASCII-only.

I don't know whether all engines have actually implemented this 2022
grammar change. But it's also a safe bet that real-world usage of this
Unicode character is likely minimal. So the web-compat risk seems small.

If this PR is accepted, then we'll follow up with a normative Temporal
PR to remove this character from Temporal as well.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants