Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/badger db v3 #96

Merged
merged 17 commits into from
Feb 18, 2021
Merged

Feature/badger db v3 #96

merged 17 commits into from
Feb 18, 2021

Conversation

evgmik
Copy link
Collaborator

@evgmik evgmik commented Feb 12, 2021

Addresses #92

badger v3.2011.1 made a few old options obsolete so they are removed.
I have adjusted the value log file to 10MB (from default 2GB), and memory table size to 10MB from default 128MB. If we do our job right we do not need that much anyway.

Another reason for this changes. badger allocate above values on hard drive, if user does not use modern FS which allow sparse files (with zeros). He or she would see 2GB of space disappeared before any operation with brig starts.
On modern FS, one would see actual allocation. Still 2GB file size looks scary to me.

So far, I see no negative effects from the badger version upgrade.

To to see compression improvements, the DB need to be garbage collected which can be done only at db.Close().

@evgmik evgmik requested a review from sahib February 12, 2021 04:52
@evgmik
Copy link
Collaborator Author

evgmik commented Feb 15, 2021

Looks like we ready to merge with develop.

I decided to skip GC for gateway/db/db.go it is rarely touched and is not poluted with obsolete records.

By the way, gateway db has a lot of interfaces which seems to be copy pasted from catfs/db/database_badger.go. Maybe there is a way to unite them, then they can use the same GC routine.

@sahib
Copy link
Owner

sahib commented Feb 15, 2021

I decided to skip GC for gateway/db/db.go it is rarely touched and is not poluted with obsolete records.

Fine to me.

By the way, gateway db has a lot of interfaces which seems to be copy pasted from catfs/db/database_badger.go. Maybe there is a way to unite them, then they can use the same GC routine.

Right. Might use badger also for as persistent cache for overlay caching. In that process I might add a generic KV-util, if I think the abstraction is worth the effort.

var beforeGC uint64 = 0
var total uint64 = 0
var err error
log.Infof("Performin GC for badger DB in %s", opts.Dir)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: "Performing".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, but there is a new patch which makes it obsolete.

log.Debugf("badger DB in %s has nothing for GC",opts.Dir)
}
continue
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of still running RunValueLogGC(0.5) after Close? Is it essential to call?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the Note in dgraph-io/badger#767 (comment)
the compactors, which collect statistics for GC, are run only on DB Close() at least on small size DB.
So Close is needed for our DB size.

But your comment made me do more digging. So some things about badger DB GC collector was wrongly reversed engineered by me: RunValueLogGC() actually compacts on the fly, no need to Close() to shrink the DB. The problem that RunValueLogGC does not collect stats by itself, but uses prepared by compactors.
Which leads for necessity of one Close() to make sure that compactors are forced. Apparently, even Close
does not force it and need about 1 minute delay between DB modification and Close for compactors to kick in.

So new 3cc667c patch get rid of cycled Close/Open (it does it only once per runGC call), also I get rid
of the timer since the oldest DB flush and the modification flag.

If only there were reliable way to check if DB was modified, then we would not do unnecessary Close(). Though I noticed that filesystem GC constantly messing with the metadata DB even on idle brig.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. Code looks better to me now.

One simple to gather statistic we could display would be to display how long the GC run took. I'm a little worried that it blocks all of brigs operation for considerable time, even if there weren't many writes in that time period.

}
// Badger people use os.FileInfo to get total size of lsm and vlog files
// This is not actual but allocated of hdd size
lsm, vlog := bdb.db.Size() // values could be stale even after Close/Open cycle, badger updates it once per minute
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need all this complicated logic if it's only for outputting info messages?

@@ -323,6 +400,10 @@ func (db *BadgerDatabase) Flush() error {

db.txn = nil
db.haveWrites = false
if time.Now().After(db.gcNeededAt) && !db.needsGC {
db.gcNeededAt = time.Now().Add(120 * time.Second) // badger updates stats once per minute, we give ample time for stats to settle
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a style thing: Please put the comments above the line they're referring to to avoid extra long lines.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Reading badger DB issues list, yielded the following.
RunValueLogGC() does clean up online. But on small databases (150MB) is
not big enough, the only way to update stats for GC is to close DB.
see Note in dgraph-io/badger#767 (comment)

Based on this information the logic is redone, to call Close only if
RunValueLogGC did not succeed.
Copy link
Owner

@sahib sahib left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two optional subjects that we should discuss before merging. Otherwise looks good to me 👍

// to update stats after DB was modified. But eventually GC stats will be ready.
err := bdb.db.Close()
if err != nil {
log.Fatalf("Could not close DB in %s", opts.Dir)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally see calling log.Fatalf as bad practice, especially if it's not in quick & dirty programs and especially in library code (I have a story to tell here). I'm not fully sure what the proper error handling here is, but we could enter a retry loop when fail to close or open the database.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to mitigate with 91d4826. Error in Close is not fatal, we can try later. If we cannot open, then we do ten attempts with timeouts and then Fail. We cannot run without DB. Also, fixed Close which did not check for an active transaction. GC could be done or running system, but Close definitely not.

I would love to hear this story :) I was always thinking that if something going wrong the program should faill with error message so user would send an angry bug report and we fix the problem. If we are running industrial process, then the story is different.

Copy link
Owner

@sahib sahib Feb 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error in Close is not fatal

Is it? What if it was partially closed and we're left with a zombie db conn?

I would love to hear this story :) If we are running industrial process, then the story is different.

Story is: We once used a gpio library to check some pins on our device. Sadly I didn't look into the library implementation, so that fault is on me, but their error handling was "log & exit". Which is a bad strategy as a library. I noticed this first when the program didn't first run on my work machine (obvious cause was that the gpio was not available there), but at some point we made the grave mistake to update the Go version to 1.14 and all hell broke loose. All services reading this pin suddenly crashed randomly after the update. And why was that? Because since 1.14 select() might return EINTR (rightly so) when it's execution was interrupted by the scheduler, which might happen at any point. The gpio library was not prepared for this and just jumped off the cliff by calling os.Exit(1). Fix was easy, but what we can learn from this:

  • Never ever os.Exit() in library code.
  • Check the quality of the library you're using.
  • Error handling is hard.

I was always thinking that if something going wrong the program should faill with error message so user would send an angry bug report and we fix the problem.

That's fine for interactive programs. Long running programs (and brig daemon counts as one) should do this only as last resort since the error gets buried in some logs (in the best case). But not being able to open the main database is one such case, so 91d4826 is fine with me.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now why it was a painful experience. Indeed fail is super special and should be used as a last line of defense.

log.Debugf("badger DB in %s has nothing for GC",opts.Dir)
}
continue
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. Code looks better to me now.

One simple to gather statistic we could display would be to display how long the GC run took. I'm a little worried that it blocks all of brigs operation for considerable time, even if there weren't many writes in that time period.

@evgmik
Copy link
Collaborator Author

evgmik commented Feb 17, 2021

I have added GC elapsed time the worst time is about 170mS after big cleanup. Typical time is about 15mS with Close/Open of DB before patch f7ae738.

With patch f7ae738 GC is very quick, since it does not do unnecessary (Close/Open) work on GC if stats are known to be not refreshed

Run GC only after some DB update with a delay to give time for GC
relevant stats to be updated by `badger`

This prevent doing useless DB Close/Open for the cases when we know for sure
that no changes in DB were done after last garbage collection.
extra change, retry loop done with
```
for retries := 0; retries < 10; retries++ {
```
@evgmik
Copy link
Collaborator Author

evgmik commented Feb 18, 2021

You are right: for retries := 0; retries < 10; retries++ { ... } is much cleaner. Fixed in 5e5821c.
In the same patch retrying db.Close. You are right yet again, it could be left in the half broken state after partial Close

@evgmik evgmik merged commit dcbd41a into develop Feb 18, 2021
@evgmik evgmik mentioned this pull request Feb 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants