is there a specification that Temporal uses to dictate how datetimes are parsed? #2843

BurntSushi · 2024-05-17T23:43:13Z

I am working on a Temporal-inspired datetime library for Rust, and I'm hoping it's appropriate to ask questions here. To set the tone explicitly, this is a "for my own understanding" question and not a "why aren't you doing something different" question.

I'm working on the parsing/printing aspect of my library at the moment, and I'm trying to figure out what kind of semantics I want to offer. I'm looking to Temporal as a guide here, but also the relevant standards (RFCs 3339 and 9557, along with ISO 8601). I came across this case:

let dt = Temporal.ZonedDateTime.from("2024-03-01T01:30:00[America/New_York]");
console.log(dt.toString());

which has this output:

2024-03-01T01:30:00-05:00[America/New_York]

My question is: is the format parsed here one of Temporal's own devising, or is there some other relevant standard or relevant precedent informing this behavior? More specifically, as I understand it, the [America/New_York] syntax is an extension of RFC 3339 by RFC 9557. But the perplexing part to me here is that the offset in the datetime string is missing. Yet, RFC 3339 specifically requires an offset (or Z). So it would seem the datetime string format here is invalid from the perspective of RFCs 3339 and 9557. Yet, Temporal supports it. I guess my question is: how did y'all arrive at this format?

And I ask because this looks like something one ought to support. So I want to make sure I implement it correctly. Also, I have read the docs on string parsing and formatting, but I don't think it gives a crisp answer to my question here. If I were to guess, I would say that Temporal has its own hybrid format with some of ISO 8601 (which permits 2024-03-01T01:30:00 on its own without an offset) and some of RFC 3339/9557.

The text was updated successfully, but these errors were encountered:

nekevss · 2024-05-18T01:40:39Z

Hi! Some of the champions may be able to answer in a bit more detail the how. From what I've come across during implementation, it is a bit of a hybrid but follows fairly closely to the specification. There's also some small nuances where this differs in 9557 that Temporal is more strict than the specification on the permanently registered keys (ixdtf readme for reference).

The specification also supports extended year-month, month-day, and time strings, which isn't entirely introduced in 9557, but derived (so to speak) from the extension.

EDIT: I believe the current plan is to use ixdtf in icu_calendar and temporal_rs if you're at all interested in using that crate once it's released.

BurntSushi · 2024-05-18T13:20:54Z

Ahhh okay, from unicode-org/icu4x#2127 I now see the ISO 8601 grammar that Temporal uses. I hadn't seen that before. That's exactly what I was looking for and I think precisely answers my question: Temporal uses its own hybrid format. I think it's worth while matching it given there is a precise specification.

nekevss · 2024-05-18T15:27:41Z

Ahhh okay, from unicode-org/icu4x#2127 I now see the ISO 8601 grammar that Temporal uses

Yep! The ISO 8601 grammar set from the proposal was used to implement the parser over in temporal_rs. That parser was the initial base for the current implementation of ixdtf in icu4x. So the baseline support should be there for temporal if you want another implementation example 😄

BurntSushi · 2024-05-18T16:14:22Z

Aye thanks. Other than my dependency philosophy, the main issue there is that it parses from &str instead of &[u8]. So every parse will necessarily require UTF-8 validation. And that puts a ceiling on perf.

justingrant · 2024-05-20T00:11:22Z

it parses from &str instead of &[u8]

@sffc do you know why this decision was made? RFC 9557 strings are guaranteed to be ASCII per the RFC 9557 ABNF.

Temporal uses its own hybrid format. I think it's worthwhile matching it given there is a precise specification.

As noted above, the grammar in the Temporal spec describes acceptable formats that can be parsed from string date/time inputs. These are (modulo a few corner cases) subsets of the syntax defined in RFC 9557. (An RFC that the Temporal champoins helped to author, so the dependency is not accidental!)

Beyond the grammar, it may be helpful for you to understand the general principles that underlie how each Temporal type parses values:

The general approach throughout Temporal is that each Temporal type requires only the subset of the RFC 9557 syntax that's needed to populate that type. Other components are ignored. For example, Temporal.PlainDate.from accepts "2024-05-19" but also "2024-05-19T13:47" and "2024-05-19T13:47[u-ca=chinese]" and "2024-05-19T13:47-07:00" and "2024-05-19T13:47-07:00[America/San_Francisco]" and "2024-05-19T13:47-07:00[America/San_Francisco][u-ca=chinese]". And probably a few other combinations, as long as the core year/month/day values are present in the string.

In a few cases, this simple subsetting strategy is augmented with a few semantically-important rules:

Per RFC 9557, the Z offset means "I know the instant but not the local time nor the local offset", so you can't parse a string using a Z offset with any of the Plain* types, because those types are, by definition, focused on local time not epoch-relative timestamps.
Time values can be omitted and are defaulted to 00:00. For example, Temporal.PlainDateTime.from('2024-05-19') is equivalent to Temporal.PlainDateTime.from('2024-05-19T00:00').
When parsing a Temporal.ZonedDateTime value, if the offset is omitted or if the offset is Z, then the offset is taken from the time zone in the string. So Temporal.ZonedDateTime.from("2024-03-01T01:30:00[America/New_York]") and Temporal.ZonedDateTime.from("024-03-01T06:30:00Z[America/New_York]") are equivalent to Temporal.ZonedDateTime.from("2024-03-01T01:30:00-05:00[America/New_York]"). This is equivalent to the behavior of the offset: 'ignore' option.
Seconds and smaller values can be omitted from time values.
IIRC, there's some oddness around parsing of Temporal.PlainMonthDay values to avoid ambiguous values, but I don't remember those details at the moment.

BurntSushi · 2024-05-20T00:18:54Z

@justingrant Thank you! I gleaned most of that from the Temporal API docs. But there were a few particulars there that I hadn't quite internalized yet (like seconds being optional). I think the bit I was most curious about here were the bits that specifically aren't a subset of RFC 9557, RFC 3339, ISO 8601 or any other specification outside of Temporal. But I think the grammar given in the Temporal spec (which I had indeed missed before asking this question) clears that up for the most part.

nekevss · 2024-05-20T00:27:46Z

@justingrant There is one non-ascii character in the Temporal grammar specification for the TemporalSign (U+2212) that has to be handled to support the Temporal version vs. strict to the RFC9557 spec.

sffc · 2024-05-20T05:59:10Z

it parses from &str instead of &[u8]

@sffc do you know why this decision was made? RFC 9557 strings are guaranteed to be ASCII per the RFC 9557 ABNF.

My expectation is that it would parse &[u8] but also accept &str in order to support the FromStr trait. If the IXDTF crate doesn't already have a try_from_bytes function, we can add it before we release it.

That said, I now have a recollection that the strings are not always ASCII because they allow ndash in place of hyphen for negative numbers.

justingrant · 2024-05-20T06:59:54Z

@justingrant There is one non-ascii character in the Temporal grammar specification for the TemporalSign (U+2212) that has to be handled to support the Temporal version vs. strict to the RFC9557 spec.

Ugh, yeah. Good catch. Although if there's some huge advantage in performance in excluding that, then you may want to file an issue in the Temporal repo to ask if that character could be excluded from the grammar. It would be a shame if some random unicode character that almost no one uses would cause a permanent performance loss and RAM tax.

nekevss · 2024-05-20T12:22:22Z

The only issue would be that I think TemporalSign is in the current ECMAScript specification. Unless that specification is the upcoming one. Looks like it may be needed for compatibility with Date's TimeZone offset string format.

gibson042 · 2024-05-20T16:59:09Z

That said, I now have a recollection that the strings are not always ASCII because they allow ndash in place of hyphen for negative numbers.

@sffc Not U+2013 EN DASH, but U+2212 MINUS SIGN.

The only issue would be that I think TemporalSign is in the current ECMAScript specification. Unless that specification is the upcoming one. Looks like it may be needed for compatibility with Date's TimeZone offset string format.

Strictly speaking yes, but in practice ECMA-262 uses it only in IsTimeZoneOffsetString and ParseTimeZoneOffsetString, which theirselves only receive input from SystemTimeZoneIdentifier so it could still be removed IMO.

nekevss · 2024-05-20T17:25:37Z

Strictly speaking yes, but in practice ECMA-262 uses it only in IsTimeZoneOffsetString and ParseTimeZoneOffsetString, which theirselves only receive input from SystemTimeZoneIdentifier so it could still be removed IMO

There's no system that might return U+2212 in an identifier? If not and it could be removed, then that would be great. It would essentially align the specification with RFC9557 as well as simplify parsing only having to handle ASCII.

ptomato · 2024-05-20T17:32:24Z

I can't remember why we decided to allow U+2212. I think it might have been because ISO 8601 allows it, but I'd need to check ISO 8601.

gibson042 · 2024-05-20T19:25:12Z

There's no system that might return U+2212 in an identifier?

It has been allowed since at least tc39/ecma262#2781 (comment) , but IIRC we couldn't find a system that definitively supplied an offset time zone identifier at all, let alone one that is not fully ASCII.

I can't remember why we decided to allow U+2212. I think it might have been because ISO 8601 allows it, but I'd need to check ISO 8601.

Yes, along with comma for the integer–fraction separator.

justingrant · 2024-05-21T06:15:19Z

I'd be inclined to remove support for U+2212. If it causes compatibility issues down the line it could always be added, but the benefit of guaranteeing an ASCII-only interchange format (and compatibility with RFC 9557) seems pretty large, compared to some theoretical compatibility with an obscure Unicode character.

Would anyone on this thread object to removing support for this character in the Temporal grammar?

gibson042 · 2024-05-21T16:10:26Z

Works for me.

Following ISO-8601, tc39#2781 introduced U+2212 (Unicode minus) as an alias for the regular ASCII minus sign for use in time zone offsets. There's two new data that lead me to believe that this was a mistake, and that we should revert this change. The first is that the newly-released RFC 9557 (the string format standard that Temporal uses) disallows non-ASCII characters. Its predecessor RFC 3339 also disallows non-ASCII characters. So strings that follow the current (since 2022) ECMAScript spec could be rejected by RFC 9557 clients. The second new data is feedback from implementers of a Rust version of Temporal that this single obscure character in the grammar will incur a performance cost because they must now use Rust strings instead of plain U8 ASCII data. See tc39/proposal-temporal#2843 (comment) This performance issue doesn't seem to be limited to Rust. Any native implementation would likely benefit from being able to know that valid date/time input (both Date and Temporal) is always ASCII-only. I don't know whether all engines have actually implemented this 2022 grammar change. But it's also a safe bet that real-world usage of this Unicode character is likely minimal. So the web-compat risk seems small. If this PR is accepted, then we'll follow up with a normative Temporal PR to remove this character from Temporal as well.

justingrant · 2024-05-23T00:52:43Z

Somewhat bad news: tc39/ecma262#2781 introduced U+2212 to ECMA-262 back in 2022. I think there's still a strong case for removing support for this character because it's not valid for the newly-released RFC 9557, but removing it will take more than a Temporal PR.

I just filed tc39/ecma262#3334 to see if we can revert this character (added in 2022) from ECMAScript, and if that's approved then we can remove it from Temporal too in a later normative PR.

justingrant · 2024-05-23T01:14:08Z

Aaaaaand, here's the Temporal PR: #2856

sffc · 2024-05-23T01:51:40Z

Implementation-wise, removing the character would make things slightly easier, but I mean any parser needs to be able to support multi-byte tokens anyway since year/month/day are multi-byte; it's just that this token will be multi-byte with non-ASCII bytes (and in UTF-16 it doesn't need an extra byte).

EDIT: To be more clear, given that it's not valid in RFC 9557, and since it will result in slightly lower complexity implementation code, I'm still mildly in favor of removing it from the grammar.

justingrant · 2024-05-23T03:48:45Z

year/month/day are multi-byte

Oh! What's an example of this?

ptomato · 2024-05-23T17:09:28Z

Here's the relevant part of ISO 8601:

In an environment where use is made of a character repertoire based on ISO/IEC 646, “hyphen” and “minus” should be both mapped onto “hyphen-minus”.

(ISO/IEC 646 is, approximately, 7-bit ASCII.)

It's a bit weird to think of JS as "an environment with a character repertoire based on 7-bit ASCII".

sffc · 2024-05-24T07:48:51Z

year/month/day are multi-byte

Oh! What's an example of this?

I mean, 2024 is 4 code units. A tokenizer needs to operate on ranges, not on single code units

justingrant · 2024-05-24T15:51:29Z

Oh, I thought you were saying that year/month/day could have multi-byte characters. Sorry for misunderstanding.

Following ISO-8601, tc39#2781 introduced U+2212 (Unicode minus) as an alias for the regular ASCII minus sign for use in time zone offsets. There's two new data that lead me to believe that this was a mistake, and that we should revert this change. The first is that the newly-released RFC 9557 (the string format standard that Temporal uses) disallows non-ASCII characters. Its predecessor RFC 3339 also disallows non-ASCII characters. So strings that follow the current (since 2022) ECMAScript spec could be rejected by RFC 9557 clients. The second new data is feedback from implementers of a Rust version of Temporal that this single obscure character in the grammar will incur a performance cost because they must now use Rust strings instead of plain U8 ASCII data. See tc39/proposal-temporal#2843 (comment) This performance issue doesn't seem to be limited to Rust. Any native implementation would likely benefit from being able to know that valid date/time input (both Date and Temporal) is always ASCII-only. I don't know whether all engines have actually implemented this 2022 grammar change. But it's also a safe bet that real-world usage of this Unicode character is likely minimal. So the web-compat risk seems small. If this PR is accepted, then we'll follow up with a normative Temporal PR to remove this character from Temporal as well.

nekevss mentioned this issue May 20, 2024

Migrate from &str to &[u8] in ixdtf unicode-org/icu4x#4918

Merged

justingrant added the meeting-agenda label May 21, 2024

justingrant mentioned this issue May 23, 2024

Normative: Revert U+2212 (Unicode minus sign) as a valid character in timezone offsets tc39/ecma262#3334

Merged

ptomato removed the meeting-agenda label Sep 9, 2024

ptomato added the question label Sep 9, 2024

ptomato closed this as completed Sep 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

is there a specification that Temporal uses to dictate how datetimes are parsed? #2843

is there a specification that Temporal uses to dictate how datetimes are parsed? #2843

BurntSushi commented May 17, 2024 •

edited

Loading

nekevss commented May 18, 2024 •

edited

Loading

BurntSushi commented May 18, 2024 •

edited

Loading

nekevss commented May 18, 2024

BurntSushi commented May 18, 2024

justingrant commented May 20, 2024

BurntSushi commented May 20, 2024 •

edited

Loading

nekevss commented May 20, 2024 •

edited

Loading

sffc commented May 20, 2024

justingrant commented May 20, 2024

nekevss commented May 20, 2024

gibson042 commented May 20, 2024

nekevss commented May 20, 2024

ptomato commented May 20, 2024

gibson042 commented May 20, 2024

justingrant commented May 21, 2024

gibson042 commented May 21, 2024

justingrant commented May 23, 2024

justingrant commented May 23, 2024

sffc commented May 23, 2024 •

edited

Loading

justingrant commented May 23, 2024

ptomato commented May 23, 2024

sffc commented May 24, 2024

justingrant commented May 24, 2024

is there a specification that Temporal uses to dictate how datetimes are parsed? #2843

is there a specification that Temporal uses to dictate how datetimes are parsed? #2843

Comments

BurntSushi commented May 17, 2024 • edited Loading

nekevss commented May 18, 2024 • edited Loading

BurntSushi commented May 18, 2024 • edited Loading

nekevss commented May 18, 2024

BurntSushi commented May 18, 2024

justingrant commented May 20, 2024

BurntSushi commented May 20, 2024 • edited Loading

nekevss commented May 20, 2024 • edited Loading

sffc commented May 20, 2024

justingrant commented May 20, 2024

nekevss commented May 20, 2024

gibson042 commented May 20, 2024

nekevss commented May 20, 2024

ptomato commented May 20, 2024

gibson042 commented May 20, 2024

justingrant commented May 21, 2024

gibson042 commented May 21, 2024

justingrant commented May 23, 2024

justingrant commented May 23, 2024

sffc commented May 23, 2024 • edited Loading

justingrant commented May 23, 2024

ptomato commented May 23, 2024

sffc commented May 24, 2024

justingrant commented May 24, 2024

BurntSushi commented May 17, 2024 •

edited

Loading

nekevss commented May 18, 2024 •

edited

Loading

BurntSushi commented May 18, 2024 •

edited

Loading

BurntSushi commented May 20, 2024 •

edited

Loading

nekevss commented May 20, 2024 •

edited

Loading

sffc commented May 23, 2024 •

edited

Loading