Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add binlogs to backup providers #3581

Open
derekperkins opened this issue Jan 23, 2018 · 16 comments
Open

Add binlogs to backup providers #3581

derekperkins opened this issue Jan 23, 2018 · 16 comments

Comments

@derekperkins
Copy link
Member

Once Vitess is aware of binlogs, we can potentially expand their usage for other operations.

@bbeaudreault
Copy link
Contributor

cc @acharis @hmcgonig we backup binlogs periodically outside of vitess, but would be nice to move to this when it's ready as it could streamline but backups and recovery.

@guidoiaquinti
Copy link
Member

guidoiaquinti commented Feb 16, 2018

This functionality can be the first step in significantly speed ing up tablet provisioning. We could also provision and sync a new replica without the master being involved in the operation (saving a good amount of resources).

Example:

  1. new replica is provisioned
  2. replica download & restore the last available backup for the shard
  3. replica download & apply all the available binlogs
  4. replica connects to any up to date replica in the pool to catch up with replication. Replica will download and apply everything committed between the timestamp of the last downloaded binlog and time.now()
  5. replica connects to the master tablet catching up with replication (this should be a quick operation if the replica at point 4 is not lagged)
  6. replica is ready and synced

Step 4 is technically not necessary but simply a nice to have. I see its value only if binlogs are uploaded very infrequently and/or are very large.

Regarding the binlog backup implementation what do you think about:

  • using a lock per shard. The tablet that holds the lock is the owner of the binlog backup operation
  • the tablet query MySQL to get the binlog filename every X seconds? If the filename changes we can safely upload it (as MySQL rotated it)

@derekperkins
Copy link
Member Author

replica download & apply all the available binlogs

I think you'd want some way to filter those binlogs based on GTID or something. Maybe we edit the binlog filename to include beginning/ending GTID to filter on. Another option would be to store a GTID->filename mapping alongside the binlog lock.

using a lock per shard. The tablet that holds the lock is the owner of the binlog backup operation

Sounds good to me. We'd need to hook into reparenting logic to make sure that the binlog replica hasn't been promoted to master.

tablet query MySQL to get the binlog filename every X seconds

Also seems logical. What about retention? Do we delete the local binlog immediately on successful upload?

@guidoiaquinti
Copy link
Member

I think you'd want some way to filter those binlogs based on GTID or something. Maybe we edit the binlog filename to include beginning/ending GTID to filter on. Another option would be to store a GTID->filename mapping alongside the binlog lock.

Yeah we definitely need some logic to download only the data that we need. We should not be worried about the "apply" as MySQL will automatically skip all transactions with GTID < gtid_next value.

Sounds good to me. We'd need to hook into reparenting logic to make sure that the binlog replica hasn't been promoted to master.

I'm not sure I'm following. Why do we need this?

Also seems logical. What about retention? Do we delete the local binlog immediately on successful upload?

I don't think we should do that as we need the binlogs to allow other replicas to catch up. I think we should leave the file rotation/retention to MySQL.

@derekperkins
Copy link
Member Author

Sounds good to me. We'd need to hook into reparenting logic to make sure that the binlog replica hasn't been promoted to master.

I'm not sure I'm following. Why do we need this?

It wouldn't necessarily break anything, but I don't think you want your master also doing your binlog backups.

@guidoiaquinti
Copy link
Member

Yes, we are on the same page: during reparenting the replica-now-new-master should release the lock

@bbeaudreault
Copy link
Contributor

hey guys, thanks for creating this. This issue seems to focus mostly on using it for speeding up replica provisioning. If we are to build a feature, it should also be useful for Disaster Recovery. That involves additional requirements:

  • tunable upload interval. You mentioned wanting a long interval so as to optimize provisioning (which makes sense). However, in the event of a disaster that means you are limited to how close to the disaster you can recover. On our non-vitess backups we run on 5 minute interval, which allows us to recover to within 5 minutes of a disaster (assuming total loss of master and other slaves).

  • Filtering by GTID. @derekperkins mentioned this, but @guidoiaquinti took that to mean in the form of skipping already processed GTIDs. There is another function here -- allowing the DBA to skip bad transactions. This is for user-created disasters such as bad DMLs or DDLs.

  • Time-based filtering. This plays on the above, in that it would be useful to be able to say "Replay up to this end time, so I can take manual action from there".

  • Option to skip re-enabling replication. This is useful for the manual recovery phase, where you dont want the replica to auto-join the master.

  • Improvements to the vtctl RestoreFromBackup command. Currently it always takes the latest backup. We should be able to choose a specific backup, along with the arguments required to make the above work.

There may be other improvements as well, but I consider these to be the bare minimum (in addition to the spec outlined in past comments).

@bbeaudreault
Copy link
Contributor

A future improvement to the above could allow replicas to serve different purposes. One uploads at 5m interval for DR, and another uploads hourly chunks for speedy provisioning. This would be useful for those who can afford the extra storage. For now I think we can do without though.

@guidoiaquinti
Copy link
Member

guidoiaquinti commented Feb 16, 2018

@bbeaudreault all ✅

Minor point regarding "tunable upload interval": I don't see a case where we shouldn't upload a binlog file right after it has been rotated by MySQL. Do you want to upload as well the binlog file that MySQL is still using?

Looping as well @ameetkotian @demmer and @rafael in the conversation as we were all discussing about this functionality yesterday

@bbeaudreault
Copy link
Contributor

@guidoiaquinti actually yea, that works. I missed that part and got hung up on the second sentence here:

Step 4 is technically not necessary but simply a nice to have. I see its value only if binlogs are uploaded very infrequently and/or are very large.

But to your point, this could be configurable by setting the max binlog size appropriately in the mysql configs, rather than as a tablet config.

so 👍 from me

@bbeaudreault
Copy link
Contributor

bbeaudreault commented Feb 16, 2018

Oh, I realized -- what we do is we force a particular interval by using a cron job which first calls FLUSH BINARY LOGS then uploads any new files. So in that way we can ensure that we're uploading at a particular period. That is what would be nice to be tunable.

This way, regardless of write volume, you can ensure that you're always recoverable to within X minutes.

I suppose we could use a separate process to do this and just let vitess notice the new log and upload it. Would be nice to be built-in, but I guess not necessary if we take that approach.

@guidoiaquinti
Copy link
Member

Oh, I realized -- what we do is we force a particular interval by using a cron job which first calls FLUSH BINARY LOGS then uploads any new files. So in that way we can ensure that we're uploading at a particular period. That is what would be nice to be tunable.

Good point, this is probably the best way to enforce a valid DR policy (e.g. we can lose till X mins of binlogs)

@alainjobart
Copy link
Contributor

I could talk for hours about this subject, and all the ramifications working on some of these features would have... Here are some related thoughts:

A. High replica count use cases: In setups where there are many replicas per shard (tens of them), a simple star-shaped replication scheme gets very limiting:

  • large number of slaves adds extra load on the master.
  • multiple slaves in remote regions all use expensive bandwidth to replicate the same data stream from the master.
  • when reparenting, you have to reconnect all slaves to the new master. It gets slow, as you may have to wait for a large number of slaves before re-enabling writes.
  • when restoring a backup, you have to catch up using the master (as described here), possibly from a day or two ago.

Internally, we developed a tool very similar to 'binlog server': it's basically a mysql replication node that only knows how to replicate binlogs, stores them locally, and can serve them to other slaves. If such nodes are inserted in between the master and the replicas, it gives the user a much better replication profile:

  • only the binlog servers connect to the master, limiting the load on the master. If they have somewhat persistent storage, they're never too far behind.
  • if you put a couple binlog servers in remote regions, that's enough to feed all the region replicas.
  • after restoring a backup, all the data comes from the closest binlog server, never from the master.
  • the replicas can be configured to connect to any binlog server locally, they don't need to know 'the master'.

But this tool only makes sense with a somewhat high replica count, which may not be a very common use case. However...

B. Binlogs backups: binlogs have a very cool property: they're append-only streams. It's a very different pattern than backups (big one-time dump), or data files (random access all the time). It makes then very similar to log files. Some Cloud providers have very cool cheap storage for append-only files.

Back to the 'binlog server', you could have one instance of the binlog server backed by one of these storage systems: as it gets binlogs from the master, they're added to the cloud storage, making it a backup with very little latency (write binlogs every second, or if bigger than 128k for instance).

Then you may not even need binlog streams to stand up more 'binlog servers', they could just read the cloud storage generated by the main guy (and use master election to know which one owns writing to the cloud storage... good thing we have master election code in our topo API!).

Also, when taking a backup, you store the exact position of the backup in the replication stream, so when you bring up a replica, it's very easy to find the replication starting point.

With both backups and filtered replication streams being stored securely in your distributed cloud storage, it also becomes very easy to stand up a replica at exactly a replication point. Very easy to access data at any point in time then, even from days or weeks ago, depending on the retention policy of the backups and binlogs. Very cool feature to get out of a jam when a bad application change wiped out critical data.

C. Different binlog strategies: to save local processing on the replicas, only the replicas that can be elected as masters should save their binlogs. The others should not, as they don't need to. Most setups now enabling local binlogs on all replicas.

in Vitess, filtered replication is connecting to replicas, but it could also connect to the binlog servers, obviously. Filtering is done on the server side for these, so they'd need to know the vschema, easy enough.

D. Pre-filtered binlog streams: when we split shards, we have to split the replication stream on the fly. The destination split shards get a subset of the replication stream. But what if the replication stream was already split, by default on every shard, into multiple streams, each for a smaller keyrange? They when splitting a shard, you can just subscribe to a smaller number of streams. The 'binlog servers' could pre-split the streams by keyrange as they store them in the Cloud storage.

We could also pre-split the backups, but that's harder, as we just save the entire data files for backup.

E. Reversing how we do filtered replication: somewhat related to this topic. The way we do filtered replication right now is a bit convoluted: the destination shard master vttablet connects to a source shard replica, gets a replication stream from the binlog, turns these into SQL, and re-plays the SQL locally.

An alternate solution would be to let the destination shard master have a MySQL replication protocol master that knows how to server the filtered binlog replication stream. Serving this from a binlog server would be somewhat easy. And we'd let MySQL replication do the work of remembering where we left of for us (now that it supports multiple sources, we can do that for merges too, not just splits).

Wow that's a lot of cool stuff we could do here... and I'm not even going into Sugu's plan of using semi-sync to control master commits, that would allow multi masters and Paxos master election... You can tell Sugu and I have been thinking about it a lot hehe

@bbeaudreault
Copy link
Contributor

Great write up @alainjobart! Knowledge.. dropped.

It does seem like vitess is well positioned to provide a new binlog server binary. We know the topology and we have the lock service, two big components for such a feature. This does seem like a larger project than periodic uploads, but seems both higher fidelity and more scalable. One could recover to within seconds or less, and scale for reads by just adding more readers to the cloud storage. Would be really sweet.

@demmer
Copy link
Member

demmer commented Feb 16, 2018

Couldn't agree more. This sounds like a hugely valuable addition to the Vitess portfolio of features.

@derekperkins
Copy link
Member Author

derekperkins commented Feb 16, 2018

This has a lot of interesting ramifications about what Vitess is, which is obviously important to @sougou and @jvaidya now that they're building a business. This is one of the growing list of features that is indifferent to sharding. This is an opinionated way to operate MySQL at any scale. Even if you were a startup with 100 MB of data, Vitess becomes a compelling solution that integrates and automates backups, failover and monitoring. In the same way that Kubernetes "won" the orchestration wars, features like this position Vitess to "win" MySQL deployments. Why would you ever choose to run vanilla MySQL if/when it's just as easy to spin up Vitess?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants