Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OData filter: date/time fields more precise than millisecond #459

Closed
matthew-white opened this issue Mar 1, 2022 · 11 comments · Fixed by #716
Closed

OData filter: date/time fields more precise than millisecond #459

matthew-white opened this issue Mar 1, 2022 · 11 comments · Fixed by #716
Assignees

Comments

@matthew-white
Copy link
Member

We are seeing cases where filtering OData does not exclude all the submissions it's expected to. @lognaturel observed while using the gt operator:

If my last update timestamp is 2022-02-23T23:28:50.058Z, it includes a submission made at timestamp 2022-02-23T23:28:50.058Z. If I use 2022-02-23T23:28:50.059Z in the query, that submission is excluded.

I think this is because Postgres timestamp columns are more precise than millisecond. This also came up in #358 (which led to b15086d).

This comes up when fetching any submissions created since the latest known submission. One way to handle this is to ignore duplicate submissions. If that's not possible, 1ms could be added to the latest timestamp.

We could attempt to fix this in the code, probably by truncating any timestamp column. We could truncate a timestamp column in queries in which it is used to filter, though we would probably have to add an index for that. We could also consider truncating the timestamp when it is stored. It looks like the precision of Postgres timestamp columns can be configured (as part of the data type of the column: docs).

Alternatively, perhaps we don't change the code and simply document the issue in the API docs, along with the different strategies to handle it.

@sadiqkhoja
Copy link
Contributor

Since nodejs / javascript Date object has just millisecond resolution, best route to take would be to change the precision of timestamp in database. I suspect there are several issues (known/unknown) across the application related to timestamp, I propose we change precision of all timestamp columns across the database, what do you think? @matthew-white

@issa-tseng
Copy link
Member

it's a possibility. it'll increase the number of timestamp ties we get, which is surprisingly already a problem. is it too slow to truncate on query, and too slow to index the truncation?

@sadiqkhoja
Copy link
Contributor

sadiqkhoja commented Apr 18, 2022 via email

@matthew-white
Copy link
Member Author

matthew-white commented Apr 18, 2022

Can you please explain "timestamp ties"? Where that's happening

For example, in our main submissions query (which we use to export submissions), we sort not just by submissions."createdAt" but also by submissions.id:

order by submissions."createdAt" desc, submissions.id desc

We do that because, like @issa-tseng said, we've seen submissions that somehow have the exact same timestamp, so we need to sort by a second column in order to guarantee a stable order.

Since nodejs / javascript Date object has just millisecond resolution, best route to take would be to change the precision of timestamp in database.

I think you're right that when we retrieve a timestamp value from the database, we discard everything after millisecond:

const timestampTypeParser = { name: 'timestamp', parse: (x) => new Date(x) };

When Postgres uses timestamp values, including for filtering and sorting, they will be more precise, but once they reach Node, they will have millisecond precision. That also means that those values will have millisecond precision in the API response.

For what it's worth, it looks like the OData Edm.DateTimeOffset type supports values more precise than millisecond, so with some effort, we could probably use a greater amount of the precision that Postgres provides. That would allow the user to $filter more precisely. However, we haven't encountered a need for greater precision, so I think it should work alright to continue returning millisecond-precise timestamps over the API.

I suspect there are several issues (known/unknown) across the application related to timestamp, I propose we change precision of all timestamp columns across the database

In addition to the OData filter, I can think of one other place where this comes up. We allow the user to filter the server audit log using timestamps. If someone tries to poll the audit log for entries more recent than the most recent audit log entry they've seen, they could run into a similar issue.

In general, the fact that we use different levels of precision in Postgres vs. Node does seem like the sort of thing that could lead to subtle issues. Then again, I can't think of many places where we use timestamps to drive business logic.

For some reason, I feel unsure about changing all timestamp columns in the database so that the timestamp data type is less precise. Part of me also wonders whether the additional precision on audits."loggedAt" might somehow facilitate debugging (though I guess that's not particularly likely). But all things considered, I think I'd lean toward changing the data type of submissions."createdAt", submissions."updatedAt", and audits."loggedAt" — that feels like the easiest thing to implement and maintain. @issa-tseng, what do you think of that approach?

@matthew-white
Copy link
Member Author

matthew-white commented Apr 20, 2022

@lognaturel, @ktuite, and I discussed this issue today, and I think there's a little discomfort with the idea of reducing the precision of the timestamps in the database. I feel this discomfort myself, though I could probably be convinced out of it: given that we can also sort by id, I'm not actually sure we lose anything by having less precision in the database.

While it'd be great to retain the precision in the database, it's also the case that it'd be nice to have a consistent level of precision throughout the application: in the database, in Node, and in API responses. Reducing the level of precision in the database to millisecond would automatically result in a consistent level of precision. But we're wondering whether there might be a way without a very high level of effort to do the reverse: to increase the amount of precision that's used in Node and offered over the API.

Like @sadiqkhoja said, Date only has millisecond precision, but maybe we could find another mechanism or package that offers greater precision. Just as an example, https://github.com/wadey/node-microtime is one repository that @lognaturel came across. Probably the ideal would be to find a package with a Date-like API, where the hope would be that we would be able to use those Date-like objects without changing much application code. Just as an example, @google-cloud/precise-date seems to offer a class that extends Date and offers nanosecond precision (9 digits). Our OData parser seems to support up to picosecond precision (12 digits), so there shouldn't be any constraint there.

If we take this approach, we'd have to make sure that Postgres isn't more precise than the Node package we're using. For example, if we used a package with nanosecond precision, but Postgres is more precise than nanosecond, we'd continue to see this issue. Taking a quick look at the Postgres docs, I can't tell what the maximum level of precision is:

time, timestamp, and interval accept an optional precision value p which specifies the number of fractional digits retained in the seconds field. By default, there is no explicit bound on precision. The allowed range of p is from 0 to 6 for the timestamp and interval types.

In any case, I think the next step is to spend just a little time evaluating whether this is a viable approach. Maybe it'd be easy to substitute in an alternative to Date. Or maybe doing so would require changing application code or result in surprises. (For example, given that the Google package above doesn't seem to override valueOf(), I suspect that comparisons would only happen at the millisecond level.) Either way, it'd be useful to learn more!

@sadiqkhoja
Copy link
Contributor

I will explore options to increase precision in Node including Google's precise-date. But I would like to understand business value of more precise date. When I think from very high level, I don't see why we would need to have a precise timestamp. Data is collected offline and uploaded later on so the createdAt is not the collection time but insertion time.

From technical perspective, if we want consistent ordering then id is the field that should be used because no matter how precise timestamp we keep, there is a practical possibility that two submission occur at the same time.

Thinking about the original use case if someone wants to see submissions that came after the last seen submission then they should use id column instead of createdAt because using createdAt can also cause skipping data not just duplication even when precision of Node and Postgresql is same. For example:

Let's say timestamp of the last known submission is 2022-02-23T23:28:50.058456Z then following events occur concurrently (timing of the events is same at microsecond precision but there is an order at higher precision i.e. picosecond)

  • a new submission is inserted with createdAt 2022-02-23T23:28:50.058460Z
  • someone queries the data with createdAt gt 2022-02-23T23:28:50.058456Z -- gets only one submission
  • a new submission is inserted with createdAt 2022-02-23T23:28:50.058460Z

Later user queries the data with createdAt gt 2022-02-23T23:28:50.058460Z that will return no submission hence second submission is missed.

@issa-tseng
Copy link
Member

issa-tseng commented Apr 20, 2022

would you also then return the date at greater precision over the api? i don't think iso supports that.

edit: missed @sadiqkhoja comment and i generally agree that i'm not sure what the value of a greater precision is for the perf loss.

@matthew-white
Copy link
Member Author

Let's say timestamp of the last known submission is 2022-02-23T23:28:50.058456Z then following events occur concurrently (timing of the events is same at microsecond precision but there is an order at higher precision i.e. picosecond)

  • a new submission is inserted with createdAt 2022-02-23T23:28:50.058460Z

  • someone queries the data with createdAt gt 2022-02-23T23:28:50.058456Z -- gets only one submission

  • a new submission is inserted with createdAt 2022-02-23T23:28:50.058460Z

Later user queries the data with createdAt gt 2022-02-23T23:28:50.058460Z that will return no submission hence second submission is missed.

Very interesting! I hadn't thought of this case. I'll note that if this only comes up when there are ties on createdAt, those ties will be less likely if createdAt is more precise. But I agree that even if we have createdAt be more precise, this remains a possibility.

From technical perspective, if we want consistent ordering then id is the field that should be used because no matter how precise timestamp we keep, there is a practical possibility that two submission occur at the same time.

Thinking about the original use case if someone wants to see submissions that came after the last seen submission then they should use id column instead of createdAt

As an aside, you'd think in some cases that simple paging would help with this issue (keeping track of the offset so far). However, the OData feed returns newer submissions first, so just using an offset won't work. Similarly, /v1/audits returns newer audit log entries first.

I've noticed that GitHub seems to do something similar to what you're suggesting. For example, if you view the list of commits, newer commits are shown first, and when you go to the next page, there's an ?after query parameter that seems to specify an anchoring commit along with an offset.

For what it's worth, a mechanism like that would maybe simplify the SubmissionList component in Frontend, which just uses limit/offset ($top/$skip), so has to have a bit of logic to account for the case where one user is scrolling through the table while another user is submitting new submissions.

It'd be useful to find out whether there's a standard mechanism like this within OData. In general, we just use a subset of OData query parameters with the OData feed: $top, $skip, $filter, and $count.

By the way, note that we don't expose submissions.id over the API. Instead, for a particular form, submissions are identified by their instance ID.

@sadiqkhoja: When I think from very high level, I don't see why we would need to have a precise timestamp. Data is collected offline and uploaded later on so the createdAt is not the collection time but insertion time.

@issa-tseng: i generally agree that i'm not sure what the value of a greater precision is for the perf loss.

I think I'm also coming around to the idea that there's really no harm in reducing precision.

Even if we end up deciding to introduce a more robust way to fetch the latest submissions along the lines of @sadiqkhoja's suggestion, I still think it'd be useful to change the current behavior around filtering by createdAt, because it's a little hard to explain. But we should probably also document @sadiqkhoja's example in the API docs. Then again, if we do decide to solve this in a different way, maybe the right answer for now really is just to document all this in the API docs and not make other changes.

Just an idea: one strategy we could suggest is to use gte rather than gt, but keep track of the IDs associated with the latest submissions. Those IDs would allow the user to filter out submissions already seen while also accounting for @sadiqkhoja's case. Some use cases (including @lognaturel's I think) won't keep track of all instance IDs seen so far, but maybe it's always possible to keep track of just the latest IDs.

@issa-tseng: would you also then return the date at greater precision over the api? i don't think iso supports that.

I was wondering about that. I poked around just a little and didn't see anything explicit about this, though I didn't look for long. Date.parse(), Luxon, Postgres, and OData all seem to be happy parsing timestamps with greater precision, but it does seem possible that we'd break a downstream tool or workflow by changing the precision returned over the API.

@matthew-white
Copy link
Member Author

I'm still thinking that it'd be helpful to change the current behavior around filtering by timestamp. At some point, we may want to explore adding an alternative way to page through submissions, especially if there's a standard mechanism within OData along those lines. However, until then, I'm hopeful that making a change to the current timestamp filtering actually wouldn't require a large change to the code.

Last week, @issa-tseng, @ktuite, and I discussed what that change would look like, and there's still a preference for keeping the timestamp columns at their current level of precision. That would mean truncating timestamp columns to millisecond where they are used to filter, using something like date_trunc(). I believe the only timestamp columns we use to filter are submissions."createdAt", submissions."updatedAt", and audits."loggedAt".

A big part of that work would involve looking at indexes that reference those columns. We'll probably need to update those indexes (or add new indexes) so that they also truncate. We'd need to make sure that queries that rely on those indexes are still able to leverage them: for example, we may need to truncate those columns in some ORDER BY clauses as well. Note that these columns might be used in unexpected places. For example, audits."loggedAt" is used to determine when an app user was last used:

const _get = extender(FieldKey, Actor, Frame.define('token', readable))(Actor.alias('created_by', 'createdBy'), Frame.define('lastUsed', readable))((fields, extend, options) => sql`
select ${fields} from field_keys
join actors on field_keys."actorId"=actors.id
left outer join sessions on field_keys."actorId"=sessions."actorId"
${extend|| sql`join actors as created_by on field_keys."createdBy"=created_by.id`}
${extend|| sql`left outer join
(select "actorId", max("loggedAt") as "lastUsed" from audits
where action='submission.create'
group by "actorId") as last_usage
on last_usage."actorId"=actors.id`}
where ${equals(options.condition)} and actors."deletedAt" is null
order by (sessions.token is not null) desc, actors."createdAt" desc`);

@issa-tseng, were you think that we would be updating indexes or adding new ones?

If truncating these columns within the query ends up involving a fair amount of complexity, I think we'd be open to alternative approaches, including reducing the level of precision of these three columns within the database. However, I think we'd first like to see what truncating within the query would entail.

I also wanted to note that we're thinking that there shouldn't be an issue related to the precision mismatch between Postgres and Node. Postgres does most of the work around these timestamp columns. And if Postgres truncates in the query, and Node uses those truncated values, there shouldn't be any problematic inconsistency.

@matthew-white
Copy link
Member Author

@lognaturel
Copy link
Member

I realize I didn't see the latest couple messages here and have lost track of status!

I think I made a comment at some point that maybe we should use an opaque cursor and maybe we should wait until we solve the problem for entities. But @matthew-white brought this up again today and I remembered that it is causing pain now. If there's something reasonable we can do in the short term, it would certainly provide value.

truncating timestamp columns to millisecond where they are used to filter, using something like date_trunc()

Was this explored?

@matthew-white matthew-white added this to the v2023.1 milestone Nov 16, 2022
@sadiqkhoja sadiqkhoja self-assigned this Dec 8, 2022
sadiqkhoja added a commit to sadiqkhoja/central-backend that referenced this issue Dec 8, 2022
sadiqkhoja added a commit to sadiqkhoja/central-backend that referenced this issue Dec 8, 2022
sadiqkhoja added a commit to sadiqkhoja/central-backend that referenced this issue Dec 9, 2022
sadiqkhoja added a commit to sadiqkhoja/central-backend that referenced this issue Dec 16, 2022
sadiqkhoja added a commit to sadiqkhoja/central-backend that referenced this issue Dec 16, 2022
sadiqkhoja added a commit to sadiqkhoja/central-backend that referenced this issue Dec 19, 2022
sadiqkhoja added a commit to sadiqkhoja/central-backend that referenced this issue Dec 19, 2022
@matthew-white matthew-white moved this to ✅ done in ODK Central Dec 20, 2022
@matthew-white matthew-white removed this from the v2023.1 milestone Jan 24, 2023
sadiqkhoja added a commit that referenced this issue Jan 24, 2023
* fix #459: use for-loop instead of Promise.all in migration

* using SQL group by instead
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: ✅ done
Development

Successfully merging a pull request may close this issue.

4 participants