-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OData filter: date/time fields more precise than millisecond #459
Comments
Since nodejs / javascript |
it's a possibility. it'll increase the number of timestamp ties we get, which is surprisingly already a problem. is it too slow to truncate on query, and too slow to index the truncation? |
Can you please explain "timestamp ties"? Where that's happening
…On Sun., Apr. 17, 2022, 1:46 p.m. issa marie tseng, < ***@***.***> wrote:
it's a possibility. it'll increase the number of timestamp ties we get,
which is surprisingly already a problem. is it too slow to truncate on
query, and too slow to index the truncation?
—
Reply to this email directly, view it on GitHub
<#459 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADNKXJUOHOOHUQXLEEMZ4DVFRE77ANCNFSM5PTEDVPQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
For example, in our main submissions query (which we use to export submissions), we sort not just by
We do that because, like @issa-tseng said, we've seen submissions that somehow have the exact same timestamp, so we need to sort by a second column in order to guarantee a stable order.
I think you're right that when we retrieve a central-backend/lib/external/slonik.js Line 13 in ac515ec
When Postgres uses For what it's worth, it looks like the OData
In addition to the OData filter, I can think of one other place where this comes up. We allow the user to filter the server audit log using timestamps. If someone tries to poll the audit log for entries more recent than the most recent audit log entry they've seen, they could run into a similar issue. In general, the fact that we use different levels of precision in Postgres vs. Node does seem like the sort of thing that could lead to subtle issues. Then again, I can't think of many places where we use timestamps to drive business logic. For some reason, I feel unsure about changing all timestamp columns in the database so that the |
@lognaturel, @ktuite, and I discussed this issue today, and I think there's a little discomfort with the idea of reducing the precision of the timestamps in the database. I feel this discomfort myself, though I could probably be convinced out of it: given that we can also sort by While it'd be great to retain the precision in the database, it's also the case that it'd be nice to have a consistent level of precision throughout the application: in the database, in Node, and in API responses. Reducing the level of precision in the database to millisecond would automatically result in a consistent level of precision. But we're wondering whether there might be a way without a very high level of effort to do the reverse: to increase the amount of precision that's used in Node and offered over the API. Like @sadiqkhoja said, If we take this approach, we'd have to make sure that Postgres isn't more precise than the Node package we're using. For example, if we used a package with nanosecond precision, but Postgres is more precise than nanosecond, we'd continue to see this issue. Taking a quick look at the Postgres docs, I can't tell what the maximum level of precision is:
In any case, I think the next step is to spend just a little time evaluating whether this is a viable approach. Maybe it'd be easy to substitute in an alternative to |
I will explore options to increase precision in Node including Google's precise-date. But I would like to understand business value of more precise date. When I think from very high level, I don't see why we would need to have a precise timestamp. Data is collected offline and uploaded later on so the From technical perspective, if we want consistent ordering then Thinking about the original use case if someone wants to see submissions that came after the last seen submission then they should use Let's say timestamp of the last known submission is
Later user queries the data with |
would you also then return the date at greater precision over the api? i don't think iso supports that. edit: missed @sadiqkhoja comment and i generally agree that i'm not sure what the value of a greater precision is for the perf loss. |
Very interesting! I hadn't thought of this case. I'll note that if this only comes up when there are ties on
As an aside, you'd think in some cases that simple paging would help with this issue (keeping track of the offset so far). However, the OData feed returns newer submissions first, so just using an offset won't work. Similarly, /v1/audits returns newer audit log entries first. I've noticed that GitHub seems to do something similar to what you're suggesting. For example, if you view the list of commits, newer commits are shown first, and when you go to the next page, there's an For what it's worth, a mechanism like that would maybe simplify the It'd be useful to find out whether there's a standard mechanism like this within OData. In general, we just use a subset of OData query parameters with the OData feed: By the way, note that we don't expose
I think I'm also coming around to the idea that there's really no harm in reducing precision. Even if we end up deciding to introduce a more robust way to fetch the latest submissions along the lines of @sadiqkhoja's suggestion, I still think it'd be useful to change the current behavior around filtering by Just an idea: one strategy we could suggest is to use
I was wondering about that. I poked around just a little and didn't see anything explicit about this, though I didn't look for long. |
I'm still thinking that it'd be helpful to change the current behavior around filtering by timestamp. At some point, we may want to explore adding an alternative way to page through submissions, especially if there's a standard mechanism within OData along those lines. However, until then, I'm hopeful that making a change to the current timestamp filtering actually wouldn't require a large change to the code. Last week, @issa-tseng, @ktuite, and I discussed what that change would look like, and there's still a preference for keeping the timestamp columns at their current level of precision. That would mean truncating timestamp columns to millisecond where they are used to filter, using something like A big part of that work would involve looking at indexes that reference those columns. We'll probably need to update those indexes (or add new indexes) so that they also truncate. We'd need to make sure that queries that rely on those indexes are still able to leverage them: for example, we may need to truncate those columns in some central-backend/lib/model/query/field-keys.js Lines 23 to 34 in c5b0bba
@issa-tseng, were you think that we would be updating indexes or adding new ones? If truncating these columns within the query ends up involving a fair amount of complexity, I think we'd be open to alternative approaches, including reducing the level of precision of these three columns within the database. However, I think we'd first like to see what truncating within the query would entail. I also wanted to note that we're thinking that there shouldn't be an issue related to the precision mismatch between Postgres and Node. Postgres does most of the work around these timestamp columns. And if Postgres truncates in the query, and Node uses those truncated values, there shouldn't be any problematic inconsistency. |
This issue was mentioned on the forum here: https://forum.getodk.org/t/odk-central-api-dates-comparison-seems-to-not-working-properly/38900 |
I realize I didn't see the latest couple messages here and have lost track of status! I think I made a comment at some point that maybe we should use an opaque cursor and maybe we should wait until we solve the problem for entities. But @matthew-white brought this up again today and I remembered that it is causing pain now. If there's something reasonable we can do in the short term, it would certainly provide value.
Was this explored? |
We are seeing cases where filtering OData does not exclude all the submissions it's expected to. @lognaturel observed while using the
gt
operator:I think this is because Postgres timestamp columns are more precise than millisecond. This also came up in #358 (which led to b15086d).
This comes up when fetching any submissions created since the latest known submission. One way to handle this is to ignore duplicate submissions. If that's not possible, 1ms could be added to the latest timestamp.
We could attempt to fix this in the code, probably by truncating any timestamp column. We could truncate a timestamp column in queries in which it is used to filter, though we would probably have to add an index for that. We could also consider truncating the timestamp when it is stored. It looks like the precision of Postgres timestamp columns can be configured (as part of the data type of the column: docs).
Alternatively, perhaps we don't change the code and simply document the issue in the API docs, along with the different strategies to handle it.
The text was updated successfully, but these errors were encountered: