Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify temporal intervals #331 #394

Merged
merged 9 commits into from
Apr 29, 2023
Merged

Clarify temporal intervals #331 #394

merged 9 commits into from
Apr 29, 2023

Conversation

m-mohr
Copy link
Member

@m-mohr m-mohr commented Nov 28, 2022

It seems the implementations tend more towards the approach that the two elements in the intervals should not be the same, so I created a PR that proposes this breaking change for v2.0.0.

Fixes #331

…ance in time must be after the first instance in time. #331
Copy link
Member

@clausmichele clausmichele left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this process allows the value '24' for the hour of an end time in order to make it possible that left-closed time intervals can fully cover the day.

What's the difference in writing an interval like ["2017-01-01T00:00:00Z", "2017-01-02T00:00:00Z"] to define a full 24h day?

Including this particular case that uses T24:00:00Z seems too complicated and not clearer in my opinion.

@m-mohr
Copy link
Member Author

m-mohr commented Nov 30, 2022

@clausmichele Good question. I had to think quite a bit about it and I can't remember the exact reasons we had back then, but it clearly originates from ISO8601, but it seems they are also struggling with it:

An amendment was published in October 2022 featuring minor technical clarifications and attempts to remove ambiguities in definitions. The most significant change, however, was the reintroduction of the 24:00:00 format to refer to the instant at the end of a calendar day.

So it was there in the beginning, got removed and then added back again. I assume their reasons for allowing 24 also apply here.

Without knowing their reasoning (ISO standards must be bought 🤮), I found some edge cases where it may matter and surprise users:

  • You want to exactly specify the 28th of February as a day. If you specify it as 2020-02-28T00:00:00Z - 2020-03-01T00:00:00Z you'd actually get two days as the 29th sneaks in due to the leap year.
  • Similarly, it is a bit inconsistent to define 30 days intervals for each month where sometimes you may need to use the first of the next month and sometimes you need to use the 31st (or 28th). It is a bit easier to just use the 30th (ignoring February for now).

Has someone access to the ISO documents? I'd like to read about the 24:00:00 changes because I agree that ideally we could get rid of this special handling.

Copy link
Contributor

@LukeWeidenwalker LukeWeidenwalker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Ardweaden
Ardweaden previously approved these changes Dec 16, 2022
@clausmichele
Copy link
Member

@m-mohr were you able to access the ISO documents?

@m-mohr m-mohr requested a review from mkadunc December 21, 2022 14:01
@m-mohr
Copy link
Member Author

m-mohr commented Dec 21, 2022

Yes! Thanks to @mkadunc for pointing me into the right direction and also providing his point of view.

24 as the hour

T24:00:00 is valid and behaves the same as the next day's T00:00:00 in ISO8601.

But: I just realized that JSON Schema for the format date-time explicitly mentiones RFC3339 compliance, which disallows 24. So some schema validators may fail on T24:00:00. So we'd either need to remove the format from the schema (and loose automatic validation) or comply and disallowe 24.

Incl./excl. upper boundary

The definition of intervals is including the upper boundary, but other parts of the standard allow excluding the upper boundary, so here the standard is not very stringent and helpful. To avoid breaking everyones implementations, we should probably keep excluding it. Otherwise, users could get different results as now there's an additional day included.

Missing parts of a date-time (e.g. a date)

In ISO8601 it seems that omitted time components should simply be replaced by zero (2022-12-14 is basically 2022-12-14T00:00:00).

Conclusion

Well, it's hard to come to a conclusion here as there's so many different aspects to consider, but we likely need to "break" someones implementation as VITO seems to be contraty to all other implementations. So to come up with a precise definition, someone needs to bite the bullet anyway.

We could go the following route:

  • Keep inclusive lower boundary and exclusive upper boundary
  • Clarify that omitting a time component is equivalent to setting it to 0. Omiting a date component is equivalent to 1.
  • Disallow 24 as the hour (for backward compatibility backends can keep the previous behavior instead of throwing an error, but that should likely result in a deprecation warning in the logs).
  • Give more examples and help users to understand what they need to do
  • Disallow that the values for the lower and upper boundary are the same by specifying "uniqueItems": true and if that doesn't catch it, throw an TemporalExtentEmpty error (for backward compatibility backends can keep the previous behavior instead of throwing an error, but that should likely result in a deprecation warning in the logs).
  • We may also help in client implementations by just allowing to pass a single string (e.g. "2020-01-01") or a corresponding "Date object", which gets transformed internally into a ["2020-01-01T00:00:00Z","2020-01-02T00:00:00Z""].
  • In aggregate_temporal we allow just times (without dates) and there it seems impossible to specify the full range as 23:59:59 would be excluded and 24:00:00 is not allowed. We should point users at null, which allows ["00:00:00", null] to cover the full day.

@clausmichele
Copy link
Member

For me it would also be fine to change the API and make the upper boundary included, since it caused many headaches in the past and it would be easier to understand. However, we need to clarify it in the best way.

If we change the definition I see mostly two cases:
Case 1: I provide a complete range like ["2020-01-01T00:00:00Z","2020-01-02T00:00:00Z""] no surprises, I should know what I am asking for and I will also get data at 2020-01-02T00:00:00Z if there's some. -> all good
Case 2. I provide just the date ["2020-01-01","2020-01-02"]: I would expect to get data for the 2nd of January as well if we say that the upper boundary is included, but instead the range gets converted with zeros and therefore the data on the 2nd is excluded. -> we have to clarify this to the users

@m-mohr
Copy link
Member Author

m-mohr commented Dec 21, 2022

For me it would also be fine to change the API and make the upper boundary included, since it caused many headaches in the past and it would be easier to understand. However, we need to clarify it in the best way.

As said above, I don't think this is a good idea. Changing this would make data unreproducible as the now included day would intefere. Results would change without users noticing it.

@clausmichele
Copy link
Member

Well, all the users using the VITO back-end wouldn't notice any difference actually, since they already return the upper bound as well. Maybe it's something to discuss in the next dev telco?

@mkadunc
Copy link
Member

mkadunc commented Dec 21, 2022

* Clarify that omitting a component is equivalent to setting it to 0

Is it possible to define a union type for the date-time, date and year, where we could then describe the relationships? Or would we add this clarification to all the process parameters where this union type is used?

* We may also help in client implementations by just allowing to pass a single string (e.g. "2020-01-01") or a corresponding "Date object", which gets transformed internally into a `["2020-01-01T00:00:00Z","2020-01-02T00:00:00Z""]`.

I agree — either overload the process to allow a single parameter, or provide a helper function such as single_day_to_interval("2020-01-01").

* In `aggregate_temporal` we allow just times (without dates) and there it seems impossible to specify the full range as 23:59:59 would be excluded and 24:00:00 is not allowed. We should point users at `null`, which allows `["00:00:00", null]` to cover the full day.

Another option would be to speficy "the full range" with two identical instants - [00:00:00, 00:00:00] would be 24 hours midnight-to-midnight and [06:00:00, 06:00:00] would be 24 hours 6am to 6am. The second example wouldn't be fixed by allowing 24 as the hour. But this is assuming that we want to allowing the time intervals to span more than one calendar day (and I don't know enough about aggregate_temporal use-cases using only time, so I don't know whether this would be needed).

@m-mohr
Copy link
Member Author

m-mohr commented Dec 21, 2022

@clausmichele But there's not only the VITO implementation. We can change a lot here, but I think the exclusive upper bounds are somewhat set in stone for now due to the reproducibility issue...

Is it possible to define a union type for the date-time, date and year, where we could then describe the relationships? Or would we add this clarification to all the process parameters where this union type is used?

@mkadunc We could, but it would change the schemas quite a bit and make them more complex so I'd try to avoid that. But we have the temporal-interval subtype which we can use and then it's "only" the comparison operators which may need a clarifying word in the parameter description, I think. Although, should eq("2018", "2018-01-01T00:00:00Z") return true or false?

Another option would be to speficy "the full range" with two identical instants - [00:00:00, 00:00:00]

Hmm... something to look into, the aggregate_temporal spec is quite old and I think there's no actual user (and implementation?) for the time part right now.

@m-mohr
Copy link
Member Author

m-mohr commented Dec 21, 2022

The more you go through the processes to make changes, the worse it gets.
I'm looking at between right now, which defines min and max parameters, which can be date-time, date or time. min and max are inclusive. So if I specify between("2010", "2020") right now I'd expect that everything from 2010-01-01T00:00:00Z to 2020-12-31T23:59:59Z (both inclusive) is returning true, right? If I now add the "missing components are 0 (time)/1 (date)" part, this gets ugly as the "included" part of 2020 is really just the "first instance" (~ the first millisecond) in 2020.

…terval subtype from climatological_normal (as it has inclusive upper boundaries)
@mkadunc
Copy link
Member

mkadunc commented Jan 5, 2023

Although, should eq("2018", "2018-01-01T00:00:00Z") return true or false?

If we want to stick to JSON as the only true context for openEO types, then we have to go with the fact that these are all JSON strings and therefore return false.

If we want, we could define date and time types for openEO in more detail, in which case we could either go with "date- or time-like JSON strings represent intervals", like this:

  • "year" - represents the temporal interval covering a full year in UTC
  • "date" - represents the temporal interval covering a full day in UTC
  • "date-time"
    • a) represents the temporal interval covering a single time unit (the unit is usually second, but could be shorter, e.g. ds, cs, ms etc. in case fractional-second digits are present)
    • b) represents a time instant (the instant at the start of the indicated second, or at the start of a shorter unit in case fractional-seconds are present)

... or we could specify that "all date- or time-like JSON strings represent single instants" (in line with ISO 8601 interpretation), in which case we have:

  • "year": represents the instant at the beginning of the specified year, in UTC
  • "date": represents the instant at the beginning of the specified day, in UTC
  • "date-time": represents a time instant (the instant at the start of the indicated second, or at the start of a shorter unit in case fractional-seconds are present)

Intuitively, the first option would be more correct, but it does add some complexity to the whole system (all date- and time-like parameters would be intervals rather than single scalars)

@m-mohr m-mohr dismissed stale reviews from Ardweaden and LukeWeidenwalker February 1, 2023 17:19

The base branch was changed.

@m-mohr m-mohr marked this pull request as draft March 10, 2023 16:24
# Conflicts:
#	proposals/load_result.json
@m-mohr m-mohr requested review from jdries and aljacob March 31, 2023 15:29
@m-mohr m-mohr removed their assignment Mar 31, 2023
@m-mohr m-mohr marked this pull request as ready for review March 31, 2023 15:29
@m-mohr
Copy link
Member Author

m-mohr commented Mar 31, 2023

Ready for review!

CHANGELOG.md Outdated Show resolved Hide resolved
aggregate_temporal.json Outdated Show resolved Hide resolved
aggregate_temporal.json Show resolved Hide resolved
aggregate_temporal.json Outdated Show resolved Hide resolved
filter_temporal.json Show resolved Hide resolved
@@ -112,31 +112,27 @@
},
{
"name": "temporal_extent",
"description": "Limits the data to load from the collection to the specified left-closed temporal interval. Applies to all temporal dimensions. The interval has to be specified as an array with exactly two elements:\n\n1. The first element is the start of the temporal interval. The specified instance in time is **included** in the interval.\n2. The second element is the end of the temporal interval. The specified instance in time is **excluded** from the interval.\n\nThe specified temporal strings follow [RFC 3339](https://www.rfc-editor.org/rfc/rfc3339.html). Also supports open intervals by setting one of the boundaries to `null`, but never both.\n\nSet this parameter to `null` to set no limit for the temporal extent. Be careful with this when loading large datasets! It is recommended to use this parameter instead of using ``filter_temporal()`` directly after loading unbounded data.",
"description": "Limits the data to load from the collection to the specified left-closed temporal interval. Applies to all temporal dimensions. The interval has to be specified as an array with exactly two elements:\n\n1. The first element is the start of the temporal interval. The specified time instant is **included** in the interval.\n2. The second element is the end of the temporal interval. The specified time instant is **excluded** from the interval.\n\nThe second element must always be greater/later than the first element. Otherwise, a `TemporalExtentEmpty` exception is thrown.\n\nAlso supports unbounded intervals by setting one of the boundaries to `null`, but never both.\n\nSet this parameter to `null` to set no limit for the temporal extent. Be careful with this when loading large datasets! It is recommended to use this parameter instead of using ``filter_temporal()`` directly after loading unbounded data.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

supports unbounded intervals by setting one of the boundaries to null, but never both

Why forbidding to set both ends to null?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the same as just setting the whole parameter to null. Why having two ways for the same thing?

load_collection.json Outdated Show resolved Hide resolved
meta/subtype-schemas.json Show resolved Hide resolved
{
"id": "date_between",
"summary": "Between comparison for dates and times",
"description": "By default, this process checks whether `x` is later than or equal to `min` and before or equal to `max`.\n\nIf `exclude_max` is set to `true` the upper bound is excluded so that the process checks whether `x` is later than or equal to `min` and before `max`.\n\nLower and upper bounds are not allowed to be swapped. So `min` MUST be before or equal to `max` or otherwise the process always returns `false`.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about time-only intervals crossing midnight?

between("01:00:00", min="22:00:00", max="04:00:00): invalid, false or true?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say invalid due to

Lower and upper bounds are not allowed to be swapped. So min MUST be before or equal to max or otherwise the process always returns false.

proposals/load_stac.json Outdated Show resolved Hide resolved
@m-mohr m-mohr requested a review from soxofaan April 5, 2023 09:46
@m-mohr
Copy link
Member Author

m-mohr commented Apr 18, 2023

Has anyone the capacity to review this? @dthiex @soxofaan @clausmichele @mkadunc @LukeWeidenwalker
Thanks!

@m-mohr m-mohr merged commit 39cb6ba into draft Apr 29, 2023
@m-mohr m-mohr deleted the issue-331 branch April 29, 2023 17:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

load_collection: Clarification on temporal_extent
6 participants