Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Take snapshots less frequently #3367

Merged
merged 9 commits into from
May 3, 2019
Merged

Take snapshots less frequently #3367

merged 9 commits into from
May 3, 2019

Conversation

manishrjain
Copy link
Contributor

@manishrjain manishrjain commented May 3, 2019

Taking snapshots frequently causes a straggler follower to have a hard time getting a new snapshot streamed. If the latter takes more time than the former, then the straggler would never catch up.

So, instead of taking snapshots frequently, this PR adds a mechanism to track the progress of Raft in the p directory. This way, a restarted Alpha does not need to replay all the Raft logs, only the ones which it hasn't applied yet.

This PR adds two new flags:

  1. The duration after which we'd abort a txn.
  2. The number of Raft logs after which a snapshot would be taken.

This PR also removes a strange lastCommitTs logic, which was only spitting out an log error, without doing anything.


This change is Reviewable

@manishrjain manishrjain requested a review from a team as a code owner May 3, 2019 19:29
Copy link
Contributor

@danielmai danielmai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 5 files at r1, 1 of 1 files at r2.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @manishrjain)


dgraph/cmd/alpha/run.go, line 110 at r1 (raw file):

			" Also determines how often Rollups would happen.")
	flag.String("abort_older_than", "5m",
		"Abort any pending transactions older than this duration. The liveness of transaction"+

"liveness of a transaction"


dgraph/cmd/alpha/run.go, line 111 at r1 (raw file):

	flag.String("abort_older_than", "5m",
		"Abort any pending transactions older than this duration. The liveness of transaction"+
			" is determined based on the last mutation ran by the transaction.")

"is based on"


dgraph/cmd/alpha/run.go, line 490 at r1 (raw file):

		StrictMutations:     opts.MutationsMode == edgraph.StrictMutations,
		AclEnabled:          secretFile != "",
		SnapshotAfter:       Alpha.Conf.GetInt("snapshot_after"),

What happens if this is a negative int or zero?


worker/draft.go, line 774 at r1 (raw file):

			n.elog.Printf("Size of applyCh: %d", len(n.applyCh))
			if err := n.updateRaftProgress(); err != nil {
				glog.Errorf("While updating Raft progress: %v", err)

What is this glog for?

In the cluster logs I see this:

I0503 20:46:08.800737       1 groups.go:410] Serving tablet for: counter.val
I0503 20:46:08.901648       1 mutation.go:149] Done schema update predicate:"_predicate_" value_type:STRING list:true 
I0503 20:46:09.000878       1 mutation.go:149] Done schema update predicate:"dgraph.type" value_type:STRING directive:INDEX tokenizer:"exact" list:true 
E0503 20:46:33.735255       1 draft.go:774] While updating Raft progress: Key not found
I0503 20:47:03.737546       1 draft.go:1214] Num pending txns: 1
E0503 20:47:03.739728       1 draft.go:774] While updating Raft progress: Key not found
E0503 20:47:33.744220       1 draft.go:774] While updating Raft progress: Key not found
I0503 20:48:03.740293       1 draft.go:1214] Num pending txns: 1
E0503 20:48:03.743324       1 draft.go:774] While updating Raft progress: Key not found

I don't know what this means or what to do about "Key not found" error logs, which are appearing a lot.d

Copy link
Contributor

@martinmr martinmr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 4 of 5 files reviewed, 6 unresolved discussions (waiting on @danielmai and @manishrjain)


dgraph/cmd/alpha/run.go, line 490 at r1 (raw file):

Previously, danielmai (Daniel Mai) wrote…

What happens if this is a negative int or zero?

Pretty sure cobra lets you define flags as uints.


worker/draft.go, line 695 at r2 (raw file):

	// this in Raft WAL. Instead, this is used to just skip over log records that this Alpha
	// has already applied, to speed up things on a restart.
	snap, err := n.calculateSnapshot(10)

Maybe comment on why 10 is chosen here.


x/keys.go, line 310 at r2 (raw file):

	p.bytePrefix = key[0]
	if p.bytePrefix == byteRaft {

maybe check that the next bytes spell "raft" since you are doing that in RaftKey?

Copy link
Contributor Author

@manishrjain manishrjain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 4 of 6 files reviewed, 6 unresolved discussions (waiting on @danielmai, @manishrjain, and @martinmr)


dgraph/cmd/alpha/run.go, line 490 at r1 (raw file):

Previously, danielmai (Daniel Mai) wrote…

What happens if this is a negative int or zero?

It would just keep on doing snapshots. Doubt we really need to check for that.


worker/draft.go, line 774 at r1 (raw file):

Previously, danielmai (Daniel Mai) wrote…

What is this glog for?

In the cluster logs I see this:

I0503 20:46:08.800737       1 groups.go:410] Serving tablet for: counter.val
I0503 20:46:08.901648       1 mutation.go:149] Done schema update predicate:"_predicate_" value_type:STRING list:true 
I0503 20:46:09.000878       1 mutation.go:149] Done schema update predicate:"dgraph.type" value_type:STRING directive:INDEX tokenizer:"exact" list:true 
E0503 20:46:33.735255       1 draft.go:774] While updating Raft progress: Key not found
I0503 20:47:03.737546       1 draft.go:1214] Num pending txns: 1
E0503 20:47:03.739728       1 draft.go:774] While updating Raft progress: Key not found
E0503 20:47:33.744220       1 draft.go:774] While updating Raft progress: Key not found
I0503 20:48:03.740293       1 draft.go:1214] Num pending txns: 1
E0503 20:48:03.743324       1 draft.go:774] While updating Raft progress: Key not found

I don't know what this means or what to do about "Key not found" error logs, which are appearing a lot.d

Done. Fixed it up.


x/keys.go, line 310 at r2 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…

maybe check that the next bytes spell "raft" since you are doing that in RaftKey?

Probably not important for parsing right now. We just need to mostly skip this key.

@manishrjain manishrjain merged commit 7789ed9 into master May 3, 2019
@manishrjain manishrjain deleted the mrjn/snapshot-mojo branch May 3, 2019 21:24
manishrjain added a commit that referenced this pull request May 3, 2019
Taking snapshots frequently causes a straggler follower to have a hard time getting a new snapshot streamed. If the latter takes more time than the former, then the straggler would never catch up.

So, instead of taking snapshots frequently, this PR adds a mechanism to track the progress of Raft in the p directory. This way, a restarted Alpha does not need to replay all the Raft logs, only the ones which it hasn't applied yet.

This PR adds two new flags:
1. The duration after which we'd abort a txn.
2. The number of Raft logs after which a snapshot would be taken.

This PR also removes a strange lastCommitTs logic, which was only spitting out an log error, without doing anything.

Changes:
* Keep track of Raft progress in p directory, so it can skip over already applied dataset. Allow user to specify how often to take snapshots.
* Improve how we set raft progress.
* Make x.Parse parse the new Raft key
* Handle key not found error.
* Fix up the debug wal to spit out the last entry as well.
dna2github pushed a commit to dna2fork/dgraph that referenced this pull request Jul 19, 2019
Taking snapshots frequently causes a straggler follower to have a hard time getting a new snapshot streamed. If the latter takes more time than the former, then the straggler would never catch up.

So, instead of taking snapshots frequently, this PR adds a mechanism to track the progress of Raft in the p directory. This way, a restarted Alpha does not need to replay all the Raft logs, only the ones which it hasn't applied yet.

This PR adds two new flags:
1. The duration after which we'd abort a txn.
2. The number of Raft logs after which a snapshot would be taken.

This PR also removes a strange lastCommitTs logic, which was only spitting out an log error, without doing anything.

Changes:
* Keep track of Raft progress in p directory, so it can skip over already applied dataset. Allow user to specify how often to take snapshots.
* Improve how we set raft progress.
* Make x.Parse parse the new Raft key
* Handle key not found error.
* Fix up the debug wal to spit out the last entry as well.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants