-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Take snapshots less frequently #3367
Conversation
…dy applied dataset. Allow user to specify how often to take snapshots.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 5 files at r1, 1 of 1 files at r2.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @manishrjain)
dgraph/cmd/alpha/run.go, line 110 at r1 (raw file):
" Also determines how often Rollups would happen.") flag.String("abort_older_than", "5m", "Abort any pending transactions older than this duration. The liveness of transaction"+
"liveness of a transaction"
dgraph/cmd/alpha/run.go, line 111 at r1 (raw file):
flag.String("abort_older_than", "5m", "Abort any pending transactions older than this duration. The liveness of transaction"+ " is determined based on the last mutation ran by the transaction.")
"is based on"
dgraph/cmd/alpha/run.go, line 490 at r1 (raw file):
StrictMutations: opts.MutationsMode == edgraph.StrictMutations, AclEnabled: secretFile != "", SnapshotAfter: Alpha.Conf.GetInt("snapshot_after"),
What happens if this is a negative int or zero?
worker/draft.go, line 774 at r1 (raw file):
n.elog.Printf("Size of applyCh: %d", len(n.applyCh)) if err := n.updateRaftProgress(); err != nil { glog.Errorf("While updating Raft progress: %v", err)
What is this glog for?
In the cluster logs I see this:
I0503 20:46:08.800737 1 groups.go:410] Serving tablet for: counter.val
I0503 20:46:08.901648 1 mutation.go:149] Done schema update predicate:"_predicate_" value_type:STRING list:true
I0503 20:46:09.000878 1 mutation.go:149] Done schema update predicate:"dgraph.type" value_type:STRING directive:INDEX tokenizer:"exact" list:true
E0503 20:46:33.735255 1 draft.go:774] While updating Raft progress: Key not found
I0503 20:47:03.737546 1 draft.go:1214] Num pending txns: 1
E0503 20:47:03.739728 1 draft.go:774] While updating Raft progress: Key not found
E0503 20:47:33.744220 1 draft.go:774] While updating Raft progress: Key not found
I0503 20:48:03.740293 1 draft.go:1214] Num pending txns: 1
E0503 20:48:03.743324 1 draft.go:774] While updating Raft progress: Key not found
I don't know what this means or what to do about "Key not found" error logs, which are appearing a lot.d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 4 of 5 files reviewed, 6 unresolved discussions (waiting on @danielmai and @manishrjain)
dgraph/cmd/alpha/run.go, line 490 at r1 (raw file):
Previously, danielmai (Daniel Mai) wrote…
What happens if this is a negative int or zero?
Pretty sure cobra lets you define flags as uints.
worker/draft.go, line 695 at r2 (raw file):
// this in Raft WAL. Instead, this is used to just skip over log records that this Alpha // has already applied, to speed up things on a restart. snap, err := n.calculateSnapshot(10)
Maybe comment on why 10 is chosen here.
x/keys.go, line 310 at r2 (raw file):
p.bytePrefix = key[0] if p.bytePrefix == byteRaft {
maybe check that the next bytes spell "raft" since you are doing that in RaftKey?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 4 of 6 files reviewed, 6 unresolved discussions (waiting on @danielmai, @manishrjain, and @martinmr)
dgraph/cmd/alpha/run.go, line 490 at r1 (raw file):
Previously, danielmai (Daniel Mai) wrote…
What happens if this is a negative int or zero?
It would just keep on doing snapshots. Doubt we really need to check for that.
worker/draft.go, line 774 at r1 (raw file):
Previously, danielmai (Daniel Mai) wrote…
What is this glog for?
In the cluster logs I see this:
I0503 20:46:08.800737 1 groups.go:410] Serving tablet for: counter.val I0503 20:46:08.901648 1 mutation.go:149] Done schema update predicate:"_predicate_" value_type:STRING list:true I0503 20:46:09.000878 1 mutation.go:149] Done schema update predicate:"dgraph.type" value_type:STRING directive:INDEX tokenizer:"exact" list:true E0503 20:46:33.735255 1 draft.go:774] While updating Raft progress: Key not found I0503 20:47:03.737546 1 draft.go:1214] Num pending txns: 1 E0503 20:47:03.739728 1 draft.go:774] While updating Raft progress: Key not found E0503 20:47:33.744220 1 draft.go:774] While updating Raft progress: Key not found I0503 20:48:03.740293 1 draft.go:1214] Num pending txns: 1 E0503 20:48:03.743324 1 draft.go:774] While updating Raft progress: Key not found
I don't know what this means or what to do about "Key not found" error logs, which are appearing a lot.d
Done. Fixed it up.
x/keys.go, line 310 at r2 (raw file):
Previously, martinmr (Martin Martinez Rivera) wrote…
maybe check that the next bytes spell "raft" since you are doing that in RaftKey?
Probably not important for parsing right now. We just need to mostly skip this key.
Taking snapshots frequently causes a straggler follower to have a hard time getting a new snapshot streamed. If the latter takes more time than the former, then the straggler would never catch up. So, instead of taking snapshots frequently, this PR adds a mechanism to track the progress of Raft in the p directory. This way, a restarted Alpha does not need to replay all the Raft logs, only the ones which it hasn't applied yet. This PR adds two new flags: 1. The duration after which we'd abort a txn. 2. The number of Raft logs after which a snapshot would be taken. This PR also removes a strange lastCommitTs logic, which was only spitting out an log error, without doing anything. Changes: * Keep track of Raft progress in p directory, so it can skip over already applied dataset. Allow user to specify how often to take snapshots. * Improve how we set raft progress. * Make x.Parse parse the new Raft key * Handle key not found error. * Fix up the debug wal to spit out the last entry as well.
Taking snapshots frequently causes a straggler follower to have a hard time getting a new snapshot streamed. If the latter takes more time than the former, then the straggler would never catch up. So, instead of taking snapshots frequently, this PR adds a mechanism to track the progress of Raft in the p directory. This way, a restarted Alpha does not need to replay all the Raft logs, only the ones which it hasn't applied yet. This PR adds two new flags: 1. The duration after which we'd abort a txn. 2. The number of Raft logs after which a snapshot would be taken. This PR also removes a strange lastCommitTs logic, which was only spitting out an log error, without doing anything. Changes: * Keep track of Raft progress in p directory, so it can skip over already applied dataset. Allow user to specify how often to take snapshots. * Improve how we set raft progress. * Make x.Parse parse the new Raft key * Handle key not found error. * Fix up the debug wal to spit out the last entry as well.
Taking snapshots frequently causes a straggler follower to have a hard time getting a new snapshot streamed. If the latter takes more time than the former, then the straggler would never catch up.
So, instead of taking snapshots frequently, this PR adds a mechanism to track the progress of Raft in the p directory. This way, a restarted Alpha does not need to replay all the Raft logs, only the ones which it hasn't applied yet.
This PR adds two new flags:
This PR also removes a strange lastCommitTs logic, which was only spitting out an log error, without doing anything.
This change is