Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support expiring ops from oplog #33

Closed
wants to merge 8 commits into from

Conversation

devongovett
Copy link
Contributor

This goes along with share/sharedb-mongo#12. It handles emitting Missing operations errors when operations requested from the oplog are not there. This happens in submit and getOps (and by extension fetch, since it calls getOps).

This error should occur if a client has been offline longer than the ttl and tries to submit/fetch some ops. It should only occur if other edits have taken place while the user was offline, which have also expired. Specifically:

  1. user 1 goes offline
  2. user 1 edits
  3. user 2 edits
  4. ops are dropped (user 2’s ops expire)
  5. user 1 comes back online and submits its ops
  6. ERROR

If 3 and 4 are swapped (i.e. user 2 doesn’t edit anything before the ops are dropped), the error shouldn't occur. Basically, it should be fairly rare.

When the error happens, ShareJS sends it back to the client, which has an opportunity to notify the user that their changes could not be merged. Not ideal, but necessary to keep the data size in mongo from growing forever.

One case I thought of but I'm not sure about is whether we need to get the snapshot version in getOps when to is null to determine whether ops are expected from the oplog. Right now, the check for missing operations only occurs when there are ops returned (check ops[0].v === from). If to is null, we may also need to check if ops were expected if from < snapshot.v, which means fetching the snapshot there. I don't really like that, but it may be necessary. In submit this is handled already. What do you think?

@josephg
Copy link
Member

josephg commented Oct 19, 2014

Awesome - thanks for this. A lot of people have been very excited about this feature.

I'll review & merge in the next couple of days if I can.

@devongovett
Copy link
Contributor Author

Great, thanks. 😉

@devongovett
Copy link
Contributor Author

Not to be pushy, but is there any chance you could review this in the next few days? I'd like to get it into production asap since our db is growing really fast. Thanks a ton!

@josephg
Copy link
Member

josephg commented Oct 27, 2014

Meta: This is one of those cases where it'd be useful to have an easy pay-for-features system.

I'm currently (by choice) unemployed and I'm working on the new JSON type (which I'm finding quite enjoyable). I try to avoid context switching, especially when working on a hard problem.

I don't want to context switch to other problems purely out of social obligation. In the long run, that will make me feel resentful of the opensource work I do, and that doesn't benefit anyone. There are a few solutions which come to mind. The most obvious ones are things like gittip or having some people who I trust that can review code like this. Its something to think about going forward.

In this case the diff is actually quite small and I'll review it today.

@josephg
Copy link
Member

josephg commented Oct 27, 2014

The code looks good. Can you also add tests to see what happens if you call subscribe at an old version of the document. When that works, I'll accept the PR.

@devongovett
Copy link
Contributor Author

I get it. Where is the best place to send you tips? Gittip? We greatly appreciate the work you've put into ShareJS etc. over the years. It is a seriously awesome enabling technology.

I'll add the test for subscribing at an old version. Any thoughts about the last paragraph of the original issue? Not sure if we need to fetch the snapshot in getOps to check if ops are missing. Thanks!

@josephg
Copy link
Member

josephg commented Oct 27, 2014

Thanks - I appreciate you saying so :)

Yeah gittip is great. There's also a bitcoin address down the bottom of the gittip page.

I didn't notice that comment - I'll take another close readthrough of that code in a few hours. I remember thinking carefully about the logic in that function, and I can't remember all the cases off the top of my head. I agree that it'd be nice to avoid calling getSnapshot there - I'll have a think about it soon.

@josephg
Copy link
Member

josephg commented Oct 28, 2014

You're right - we straight out don't have enough information in lib/index.js's getOps function to do the right thing when to is null and some ops are missing.

Lets talk about the interfaces between parts. There's 3 interoperating parts in a getOps call:

  • lib/index.js getOps(), which you edited
  • The driver's getOps() function (the redis driver's function is quite complicated because it caches ops in redis, and only relies on the cache in most cases)
  • The oplog's getOps function (eg, livedb-mongo).

It seems like there's two possible APIs:

  1. Get all ops from to to. If any ops in the range are missing from the database, return an error.
  2. Get all ops from to to. If any ops in the range are missing from the database, return as many operations as we can.

Imagine the code consuming these operations. Imagine if I ask for ops 1-10 and you return ops 5-10. Lets talk through a few use cases:

  • In the submit function, partial results are useless. option 1 is slightly better.
  • In the client (doing catchup), partial results are useless. option 1 is slightly better.
  • If you want to implement playback (to scrub through history), you absolutely want the partial results. If you ask for 1-1000 and the database is missing operations 1 and 2, I don't want my request to fail - I still want 3-1000 or as many ops as you can give me. To make this work, the client would have to binary search the set of ops the server might have. Option 1 is basically unusable.

But nobody has implemented a playback control yet, as far as I know. For now I'm happy with simply returning an error if any elements of the range are missing.

Question: what is the API for the oplog and the driver? If the oplog's API is 'Best effort at returning as many ops as possible in the range', does the redis driver do the right thing in all cases? Is that the right API for each of those components?

I think we need to tighten the interface constraints. First, the oplog should not be best-effort. If I ask for ops from 10-20, it should return a continuous range of values (it should never return 10,11,13,...). If its missing some operations, it should either return a partial set to the end of the requested range, or it should error. We need to be clear about what happens here. Looking at the code, currently livedb-mongo could do literally anything here - it simply returns the set of all operations which are still in mongo. Question: How does mongo handle TTLs? Will it ever be possible for mongo to return operation 10 but not operation 11? (I bet it is.)

What API should the driver's getOps function expose? I bet the redis driver is currently buggy if the oplog returns a partial set of results. But if the oplog errors instead, the error should get propagated correctly. -> So lets just make livedb error if your range isn't in the database for now.


Separately there's the question you actually asked. What should we do if to is null and we don't find any ops at all. This is a super common case - it happens with every submitted operation. But the redis caching layer should make it not a big problem. Making this behave correctly is also important for subscribe to work properly, as you should discover when you add some more tests. Because subscribe calls the oplog's getOps function, this needs to be handled either in the driver or in the oplog (ie livedb-mongo). Putting it in livedb-mongo has the advantage that we're already adding the TTL code there anyway. I really like that the in-process driver's getOps function is 1 line long. If we add the logic to the driver, it'll be more complicated and it'll have to be in both drivers. So maybe do it there. (That code will need to be modified anyway so getOps handles missing range values correctly).

I suppose another way to do it is to have a getops wrapper function in livedb that calls oplog.getOps then checks the results. If the range is the wrong size, it errors. If from is null and the range is empty, it calls oplog.getVersion to check the version. However, there's a race condition. Imagine the document is version 10. You call getOps(10,null) which returns an empty list. At the same time, someone else submits an operation. You call getVersion to make sure that you aren't missing any operations and it says you're at version 11 - so you are missing an operation! In this case, you should retry your getOps call and make sure you get that operation you missed.

Sorry for the huge brain dump - I hope thats everything.

@devongovett
Copy link
Contributor Author

Wow, thanks for all the thought you put into this. Here's what I think we should do.

I think, for now at least, we should return an error when ops are missing. There should be a wrapper around getOps that lives in livedb (so it is driver/db agnostic), that checks the returned ops to make sure they are all there, and in order with no holes. If to is null, and no ops are returned, we call getVersion to check if ops were expected. If the version returned is greater than from, then either operations are missing, or an op was just submitted by another client. If we retry and don't get any ops, then ops are missing and we return an error. If we get ops back after a retry, then we check that there are none missing from the requested range and return them as normal if so. Does this make sense?

I think this race condition is already handled by submit since it retries if ops were submitted between the fetch call, and when the ops are actually submitted. submit also already handles the case where to is null since it fetches the snapshot (and by extension the version) first, and can check if ops are missing.

Requested ranges are checked against the returned ops now by livedb-mongo, so an error will be returned if ops are missing.
@devongovett
Copy link
Contributor Author

I decided that it was easier and made more sense to just handle this stuff from livedb-mongo since the driver calls that directly from subscribe, etc., and it means we only need to do these checks in one place. I basically did what you and I described above, so see share/sharedb-mongo#12 for the details.

In terms of this PR, there are only test changes that are required now (due to behavior change in getOps - emitting error when ops are missing vs emitting what we have). The other changes I made earlier could be backed out I guess since we handle it from the db layer. Do you want me to do that, or keep them around (even though they are not complete) to catch oddities in redis or other places?

@devongovett
Copy link
Contributor Author

Just merged master back into this branch. Any chance of reviewing this and share/sharedb-mongo#12 soon? FWIW, we've been running this in production at Storify for a few months now.

@nmai
Copy link

nmai commented Sep 27, 2015

@devongovett Without the ability to purge historical data, the official LiveDB is not a good option for me unless I had unlimited resources. Are you still using your expire-ops branch in production?

@devongovett
Copy link
Contributor Author

Yes, still running this in production.

@nmai
Copy link

nmai commented Sep 27, 2015

Sweet, I'll start using your fork. It's a shame this feature never got merged.

@josephg
Copy link
Member

josephg commented Oct 11, 2015

Reviewing? I did review it. Its missing the things I said, and missing tests.

@devongovett
Copy link
Contributor Author

Uhh... I made changes since you reviewed and moved a bunch of stuff into the mongo backend instead of doing it here. There are a few tests here, but the real meat of the logic was moved into the mongo backend, so most of the tests are there.

@@ -172,7 +172,7 @@ RedisDriver.prototype.atomicSubmit = function(cName, docName, opData, options, c
// In this case, we should write a no-op ramp to the snapshot
// version, followed by a delete & a create to fill in the missing
// ops.
throw Error('Missing oplog for ' + cName + ' ' + docName);
return callback('Missing oplog for ' + cName + ' ' + docName);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return callback('Missing oplog');

... so people can match on the error string.

@josephg
Copy link
Member

josephg commented Oct 11, 2015

Hah - I never took the money out of gittip and a week ago they quietly started refunding it :/

The code looks good. Can you also add tests to see what happens if you call subscribe at an old version of the document.

@nmai
Copy link

nmai commented Oct 11, 2015

If a document is deleted, do the ops get purged as well? Not sure if this is implemented in the standard behavior already.

@nateps nateps closed this Dec 30, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants