-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cached augmented diffs differ from on-demand adiffs #346
Comments
Can you try a cross check with http://dev.overpass-api.de/api_mmd/augmented_diff_status - it doesn't have caching enabled yet, i.e. it might be easier for comparison. Also debug=true still works: http://dev.overpass-api.de/api_mmd/augmented_diff?debug=true&id=2230741 |
yeah, I just tried it with adiff 2230746 (which corresponds to the time between
|
Looking at the actual (OSM) minutely-diffs, it seems like not all of the updates for each minute is contained in the same diff: 220430 contains data until Is that to be expected? (e.g. Does this happen with slowly uploading changesets maybe??) //cc @tomhughes How could Overpass circumvent this issue? Waiting a few minutes before actually publishing an augmented diff maybe? (But then, how long would such a buffer period have to be?) |
You can't assume time ordering of the diffs, because changes only appear in a diff once the relevant database transaction has committed so a long transaction may start before a short one but finish after it meaning the changes from the short one appear in an earlier diff. |
@tomhughes : I'm wondering if a transaction's creation timestamp is set right when the transaction starts, rather than immediately before the commit? Is there maybe some rule of thumb how long such a transaction could take in the worst case? I assume that a transaction starts with "POST /api/0.6/changeset/#id/upload" and the commit is only triggered right after the call. If that's the case a very slow upload might indeed introduce quite some delay. Not sure what the best way to deal with the issue would be. |
You're confusing database transactions and OSM changesets. They are not the same thing. Creating a changeset is a transaction. Uploading a single object to it is a transaction. Uploading a diff to it is another transaction. Closing it (if done explicitly) is yet another transaction. The only thing you need to know is that you can't assume changes in the diffs will appear to be time ordered because they won't be. I think most timestamps are assigned by ruby, not by the database, so for changeset creation will probably actually be before the transaction is even opened, but will only be visible in the diffs once the transaction closes. Same for objects in a diff - each one will be assigned as it is read by rails I think, but check the source if you want to be sure. |
Well, I thought I was talking about HTTP calls being atomic and representing one transaction each. So, yes, there may be multiple db transactions associated to a single OSM changeset. I think the main issue here is, that we have quite some semantic differences:
Indeed a very tricky question... |
The only real solution I can currently think of would be for Overpass to store an additional meta field for each object version, namely the minutely diff number (or timestamp) in which it was published. Then, the augmented diff call can produce output in 1:1 correspondence to OSM's minutely diffs, i.e. one augmented diff for each OSM minutely diff including the same objects. That would be incompatible to Overpass' current augmented diff semantics, though. @mmd-osm in #342 you mentioned that several clients are [currently requesting] augmented diffs. Do you know which applications those are? How did they not notice that a significant portion of the data is missing in the augmented diffs (if consumed immediately)? |
Obsolete due to switch from osmosis to osmdbt @tyrasd : Regarding the clients: there are some AWS clients without User Agent, and @pa5cal / @geonick are experimenting with it afaik. So I'm not sure which applications are really using it atm. If possible, changing the meta data for this use case should probably be avoided. Maybe we have some alternative: Each minutely diff comes with a state.txt file, which according to osmosis has the following contents.
Important point here is, that I was hoping that we could leverage the information in txnActive and track, how long transactions are running. Assumption would be:
@tomhughes : I didn't find much documentation on the state.txt semantics yet. Does such a BTW: I found the following description quite helpful ( -> essential reading): http://wiki.openstreetmap.org/wiki/Osmosis/Replication#Time-aligned_versus_Transaction-aligned |
Oh, cool. Nice idea! I didn't know about those fields. |
Based on all minutely diff state files, I tried to calculate a distribution for the in-flight transaction length according to
@brettch mentioned some timeframe of up to 24 hours on the osmosis wiki page, and I was a bit worried that this might seriously impact the delay we will face with augmented diffs. The good news is that a large amount of transactions that are listed under
Nevertheless, there are some extreme outliers, taking up to 8 days (!), like transaction number 260199054 below. Not sure that happened there... EDIT: Dec 2017: other activities like VACUUM and BACKUP also show up in this list, though they are not relevant. Transaction length in minutes / transaction number:
Downside of this approach - it only works with minutely diffs. Neither hourly nor daily diffs provide the respective list of active transactions |
Looking at those ids, it seems like they are all from the late 2012 period (somewhat shortly after the redaction period). These also seems to have happened in clumps: e.g. this is the last state file that http://planet.osm.org/replication/minute/000/106/720.state.txt contains many of those long-"running" transactions with the 260* ids. The following minutely diff file (http://planet.osm.org/replication/minute/000/106/721.osc.gz) doesn't seem to contain any suspicious (e.g. >> 1 minute old) data at first glance.
AFAIK, hourly/dialy diffs are just concatenated minutely diffs. So, the open transactions for a hourly diff would be the same as for the last minutely diff it was made from. |
Right, I should have mentioned that they were the just the most long-running ones. As the list is quite lengthy, I put it up on the dev server now: http://dev.overpass-api.de/tmp/psv/transactions.txt.gz - Another period with long running transactions was in Feb 2016, where we had the minutely replication issue around 001/788/263.osc.gz. Transaction 774776540 was active back then for about 4 days.
The issue is that I don't know exactly how that process works. The wiki states that those files are being generated a few minutes after the hour and they're [s]aggregating[/s]concatenating minutely diffs. |
Let's collect the valid and invalid assumptions:
This is a huge problem because this assumption is deeply built into Overpass API. In particular that the timestamp of the diff means that all changes up to that date are included.
I don't think so. Some time ago the main API may have dropped intermediate versions of objects from hourly or daily diffs. Otherwise, there would not be no gain of hourly and daily diffs over a tar file of minute diffs. To sum things up: A short term solution could be to add a fixed some-minute delay. The right amount of minutes to lose very few changes should be figured out from mmds statistics. @tomhughes Is the txnActiveList field (presence and precise semantics) part of the long term interface? I suppose not. Because that would take away useful degrees of freedom from the main DB server. As I assume that txnActiveList is not part of the long term interface, I'm reluctant to build dependencies to that. However, I'm grateful for the statistics to understand what is happening right now. I prefer as a long time solution to revise the format of the changelog file in the Overpass API database and add a field for the file from which the change has come. That way we could augment whatever change file comes in. |
Is that really so?? According to the counterexample mentioned recently on @dev, daily diffs do still contain intermediate versions of objects. |
@tyrasd : that's right, I used the term "aggregated" above (which might be misleading), but it's really only some kind of concatenation. All intermediate versions are still present in hourly and daily diffs. |
@brettch: please correct me if I'm wrong. Even the replication logic in osmosis is pretty much tight to the Postgresql transaction snapshot feature, see https://github.com/openstreetmap/osmosis/blob/master/osmosis-apidb/src/main/java/org/openstreetmap/osmosis/apidb/v0_6/impl/TransactionDao.java#L42 and https://www.postgresql.org/docs/current/static/functions-info.html#FUNCTIONS-TXID-SNAPSHOT. In particular, the list of currently active transactions in state.txt comes directly from Postgresql's txid_current_snapshot() -> xip_list (=Active txids at the time of the snapshot.). This has been this way for several years now. Changing this process would have a large impact on osmosis as well. Note: also check discussion on openstreetmap/operations#154 |
Is the assumption correct, that there is no missing/doubled data when one consumes only cached adiffs, i.e. when you consume adiffs within 60 minutes after their existence was announced by (The 60 minutes stem from
|
@mmd-osm, all your comments about how transaction processing works look right to me, but here is some additional context in case it's useful. To be honest, it's been many years since I wrote most of this and my memory is getting a bit hazy ;-) Osmosis replication processing ignores dates entirely and just uses the The minute diffs are the only ones that hit the DB directly. The hour and day diffs just roll up the minute diffs into batches. All diffs are a complete history of all changes between transaction points in the database and may contain multiple changes for a single entity. In all cases they are sorted by nodes, ways, then relations, and in increasing id/version order (so hour and day diffs are NOT a simple concatenation of minute and hour diffs, they are re-sorted). Hour and day diffs don't serve any purpose other than reducing the overhead of catching up over a long time interval. None of the diffs are exactly time aligned, they are transaction snapshot aligned. Minute diffs are simple scheduled to run once per minute and read all data that has landed since the last invocation. If the planet server goes down for a while they'll end up quite large and contain more than a minute of data. I think there's a maximum transaction limit per file in there somewhere to keep a lid on things. Hour diffs feed off them and put approximately 1 hour of diffs into a file, but each minute file is placed entirely into a single hour file and never split across two files. Some consequences of all this. The data is aligned based on the timestamp of when the transaction was committed, not when the data was inserted. Long running transactions may result in their data appearing in a later replication file than the entity timestamps would indicate. Any tool that makes assumptions about the maximum duration of a transaction is doomed to occasional data loss. Early Osmosis replication used timestamps. It worked well in the early MySQL days, but was a disaster once the PostgreSQL DB and changesets were introduced. Every time I tried increasing the "lag interval" to try to ensure all transactions had landed somebody invented a longer running transaction. The content of the state files is not really meant for client tool consumption, only for Osmosis itself. There wasn't a lot of thought into defining a public API. You can use them as a source of diagnostic information (e.g. timestamp) but that's about it. |
Hi, I have been working though this problem for a while now as part of a side project to create accurate diffs and before/after states. The problem being, as described above, the timestamp of a object doesn't necessarily correspond to when it appeared in the database and the minutely diffs. A lot of the minutely diffs have data from before the minute (usually a handful of seconds only) and even after since the generation doesn't seems to start until 2-3ish seconds after the minute. The solution I've come up with is to compute a "committed at time" for each element. This is the estimated time the element's upload transaction completed. The "augmented diff" becomes the planet minutely diff plus stuff that was committed before anything in that diff. So it's keyed off the minutely diff id, and not absolute time. I compute the "committed at time" using the minutely diffs. An object's committed at time is the max timestamps for objects with matching changeset ids in the minutely diff. I think this works since the minutely diffs are a set of finished transaction data. It's not exactly correct since there could be two uploads for a given changeset in a given minute, but the short answer is I think it'll still work. :) For elements added before minutely diffs started I do some estimates but it can't be completely correct. I've been working on this as part of my own stack. I noticed the issues with overpass and didn't want to propose such a big change because I feel like most of the use cases for overpass wouldn't justify it. There are many uses for being able to compute a true before/after state for any time and any area that I'm interested in exploring, but unfortunately I don't have anything to demo quite yet. |
In http://www.openstreetmap.org/user/geohacker/diary/40846, it sounds like @geohacker found a workaround for this issue:
@geohacker: How do you do this exactly? |
@tyrasd I was going to post here, this is a thorough hack but works alright.
This relies entirely on the fact that Overpass adiff states are consistent and can be updated even if a feature arrives later in a minutely replication. And also that we host Overpass instance to entirely do this so it's a bit more reliable. |
Oh and the augmented diffs are also available for download. https://s3-ap-northeast-1.amazonaws.com/overpass-db-ap-northeast-1/augmented-diffs/ The state of the latest augmented diff is in a file called latest, like https://s3-ap-northeast-1.amazonaws.com/overpass-db-ap-northeast-1/augmented-diffs/latest. You can request for an augmented diff this way: https://s3-ap-northeast-1.amazonaws.com/overpass-db-ap-northeast-1/augmented-diffs/2409184.osc Hope this helps and also reduce load on the Overpass instance! |
@geohacker : As mentioned before in the blog comment, there's one drawback you need to be aware of: most data consumers would assume an augmented diff not to change once it has been published. For your use case, that's perfectly ok, because users will typically fetch data at a much later point in time for changeset analysis purposes. By the time users look at the augmented diffs, they should be ok in 99,999+% of cases. In other usage scenarios (like updating stats based on augmented diffs, something @tyrasd is working on), you might get screwed by those "behind the scenes updates" of already published augmented diffs. That's a bit how Overpass behaves right now: if you call the underlying query powering the augmented_diff call multiple times across a number of minutes / hours, you might get different results. The caching mechanism for augmented diffs hides that effect a bit, but that doesn't really change the overall issue. I think I mentioned somewhere in this issue, that we could in theory delay the publication of augmented diffs until we're sure that no further data for a given timeframe will arrive. Some data consumers might find that better. On the downside, that could mean that we have to delay publication for minutes, hours and sometimes even days. Anyway, in the long run, Roland needs to come up with some magic to include replicate_id information in the database as well. And we'd probably even need much more magic to address the semantic gap between minutely diffs and timestamps. Fun fact: 3 years ago, persisting augmented diffs was given up in favor of on-the-fly generation. Quoting from the 0.7.50 version release notes:
|
Obsolete due to switch from osmosis to osmdbt I implemented a very lightweight approach in https://github.com/mmd-osm/Overpass-API/wiki/Settings-for-0.7.58mmd-branch#other-files, which basically checks for an empty txnActiveList, and writes an additional status file called Assumption here is that an empty txnActiveList implies that all pending transactions have been completed, and we should have all data for the given timestamp available in the Overpass DB. As long as there are any transaction number in this state file, we have to assume that there are still some in flight transactions, and it's better to wait. https://github.com/mmd-osm/Overpass-API/wiki/Settings-for-0.7.58mmd-branch#rules_delta_completedsh provides an example on how to control area updates and make sure that we always have the complete data when updating areas. Previously, some objects might not have been processed for the same reasons as with the current augmented diffs. This approach could of course also be used for augmented diffs as well, by delaying creating new augmented diffs until txnActiveList becomes empty. There may be some delays once in a while, still it's the easiest way as of today to ensure augmented diffs are complete. In addition, exactly zero changes are needed to the C++ backend, which is a big plus. Implementation effort wise it's fairly straightforward as well. Updated analysis: openstreetmap/openstreetmap-website#1710 (comment) |
Yesterday I was downloading some "live" augmented diffs (that are now served from a cache, see #342), but got one empty file (
2229363.osc
). After downloading the same (now, non cached) adiff from http://overpass-api.de/api/augmented_diff?id=2229363, I do actually get some data. Looking around the downloaded files a bit more and comparing them to their currently generated counterparts, I see that most (all?) of them actually don't match up with each other (the current adiffs containing more data then the disk-cached adiffs).Here's an example: https://gist.github.com/tyrasd/fe213e1d17fc95c3a43b669b1ef154e1
The text was updated successfully, but these errors were encountered: