Feature/badger db v3 #96

evgmik · 2021-02-12T04:52:30Z

Addresses #92

badger v3.2011.1 made a few old options obsolete so they are removed.
I have adjusted the value log file to 10MB (from default 2GB), and memory table size to 10MB from default 128MB. If we do our job right we do not need that much anyway.

Another reason for this changes. badger allocate above values on hard drive, if user does not use modern FS which allow sparse files (with zeros). He or she would see 2GB of space disappeared before any operation with brig starts.
On modern FS, one would see actual allocation. Still 2GB file size looks scary to me.

So far, I see no negative effects from the badger version upgrade.

To to see compression improvements, the DB need to be garbage collected which can be done only at db.Close().

This upgrades the versions and removes options which were removed upstream.

evgmik · 2021-02-15T01:19:50Z

Looks like we ready to merge with develop.

I decided to skip GC for gateway/db/db.go it is rarely touched and is not poluted with obsolete records.

By the way, gateway db has a lot of interfaces which seems to be copy pasted from catfs/db/database_badger.go. Maybe there is a way to unite them, then they can use the same GC routine.

sahib · 2021-02-15T17:42:40Z

I decided to skip GC for gateway/db/db.go it is rarely touched and is not poluted with obsolete records.

Fine to me.

By the way, gateway db has a lot of interfaces which seems to be copy pasted from catfs/db/database_badger.go. Maybe there is a way to unite them, then they can use the same GC routine.

Right. Might use badger also for as persistent cache for overlay caching. In that process I might add a generic KV-util, if I think the abstraction is worth the effort.

sahib · 2021-02-15T17:44:45Z

catfs/db/database_badger.go

+	var beforeGC uint64 = 0
+	var total uint64 = 0
+	var err error
+	log.Infof("Performin GC for badger DB in %s", opts.Dir)


typo: "Performing".

fixed, but there is a new patch which makes it obsolete.

sahib · 2021-02-15T17:47:48Z

catfs/db/database_badger.go

+				log.Debugf("badger DB in %s has nothing for GC",opts.Dir)
+			}
+			continue
+		}


What's the purpose of still running RunValueLogGC(0.5) after Close? Is it essential to call?

According to the Note in dgraph-io/badger#767 (comment)
the compactors, which collect statistics for GC, are run only on DB Close() at least on small size DB.
So Close is needed for our DB size.

But your comment made me do more digging. So some things about badger DB GC collector was wrongly reversed engineered by me: RunValueLogGC() actually compacts on the fly, no need to Close() to shrink the DB. The problem that RunValueLogGC does not collect stats by itself, but uses prepared by compactors.
Which leads for necessity of one Close() to make sure that compactors are forced. Apparently, even Close
does not force it and need about 1 minute delay between DB modification and Close for compactors to kick in.

So new 3cc667c patch get rid of cycled Close/Open (it does it only once per runGC call), also I get rid
of the timer since the oldest DB flush and the modification flag.

If only there were reliable way to check if DB was modified, then we would not do unnecessary Close(). Though I noticed that filesystem GC constantly messing with the metadata DB even on idle brig.

Thanks for the explanation. Code looks better to me now.

One simple to gather statistic we could display would be to display how long the GC run took. I'm a little worried that it blocks all of brigs operation for considerable time, even if there weren't many writes in that time period.

sahib · 2021-02-15T17:48:09Z

catfs/db/database_badger.go

+		}
+		// Badger people use os.FileInfo to get total size of lsm and vlog files
+		// This is not actual but allocated of hdd size
+		lsm, vlog := bdb.db.Size() // values could be stale even after Close/Open cycle, badger updates it once per minute


Do we really need all this complicated logic if it's only for outputting info messages?

sahib · 2021-02-15T17:48:55Z

catfs/db/database_badger.go

@@ -323,6 +400,10 @@ func (db *BadgerDatabase) Flush() error {

 	db.txn = nil
 	db.haveWrites = false
+	if time.Now().After(db.gcNeededAt) && !db.needsGC {
+		db.gcNeededAt = time.Now().Add(120 * time.Second) // badger updates stats once per minute, we give ample time for stats to settle


Just a style thing: Please put the comments above the line they're referring to to avoid extra long lines.

Reading badger DB issues list, yielded the following. RunValueLogGC() does clean up online. But on small databases (150MB) is not big enough, the only way to update stats for GC is to close DB. see Note in dgraph-io/badger#767 (comment) Based on this information the logic is redone, to call Close only if RunValueLogGC did not succeed.

sahib

There are two optional subjects that we should discuss before merging. Otherwise looks good to me 👍

sahib · 2021-02-16T18:04:56Z

catfs/db/database_badger.go

+	// to update stats after DB was modified. But eventually GC stats will be ready.
+	err := bdb.db.Close()
+	if err != nil {
+		log.Fatalf("Could not close DB in %s", opts.Dir)


I generally see calling log.Fatalf as bad practice, especially if it's not in quick & dirty programs and especially in library code (I have a story to tell here). I'm not fully sure what the proper error handling here is, but we could enter a retry loop when fail to close or open the database.

I tried to mitigate with 91d4826. Error in Close is not fatal, we can try later. If we cannot open, then we do ten attempts with timeouts and then Fail. We cannot run without DB. Also, fixed Close which did not check for an active transaction. GC could be done or running system, but Close definitely not.

I would love to hear this story :) I was always thinking that if something going wrong the program should faill with error message so user would send an angry bug report and we fix the problem. If we are running industrial process, then the story is different.

Error in Close is not fatal

Is it? What if it was partially closed and we're left with a zombie db conn?

I would love to hear this story :) If we are running industrial process, then the story is different.

Story is: We once used a gpio library to check some pins on our device. Sadly I didn't look into the library implementation, so that fault is on me, but their error handling was "log & exit". Which is a bad strategy as a library. I noticed this first when the program didn't first run on my work machine (obvious cause was that the gpio was not available there), but at some point we made the grave mistake to update the Go version to 1.14 and all hell broke loose. All services reading this pin suddenly crashed randomly after the update. And why was that? Because since 1.14 select() might return EINTR (rightly so) when it's execution was interrupted by the scheduler, which might happen at any point. The gpio library was not prepared for this and just jumped off the cliff by calling os.Exit(1). Fix was easy, but what we can learn from this:

Never ever os.Exit() in library code.

Check the quality of the library you're using.

Error handling is hard.

I was always thinking that if something going wrong the program should faill with error message so user would send an angry bug report and we fix the problem.

That's fine for interactive programs. Long running programs (and brig daemon counts as one) should do this only as last resort since the error gets buried in some logs (in the best case). But not being able to open the main database is one such case, so 91d4826 is fine with me.

I see now why it was a painful experience. Indeed fail is super special and should be used as a last line of defense.

sahib · 2021-02-16T18:14:57Z

catfs/db/database_badger.go

+				log.Debugf("badger DB in %s has nothing for GC",opts.Dir)
+			}
+			continue
+		}


Thanks for the explanation. Code looks better to me now.

One simple to gather statistic we could display would be to display how long the GC run took. I'm a little worried that it blocks all of brigs operation for considerable time, even if there weren't many writes in that time period.

…f we failed a few times

evgmik · 2021-02-17T03:37:37Z

I have added GC elapsed time the worst time is about 170mS after big cleanup. Typical time is about 15mS with Close/Open of DB before patch f7ae738.

With patch f7ae738 GC is very quick, since it does not do unnecessary (Close/Open) work on GC if stats are known to be not refreshed

Run GC only after some DB update with a delay to give time for GC relevant stats to be updated by `badger` This prevent doing useless DB Close/Open for the cases when we know for sure that no changes in DB were done after last garbage collection.

extra change, retry loop done with ``` for retries := 0; retries < 10; retries++ { ```

evgmik · 2021-02-18T01:23:24Z

You are right: for retries := 0; retries < 10; retries++ { ... } is much cleaner. Fixed in 5e5821c.
In the same patch retrying db.Close. You are right yet again, it could be left in the half broken state after partial Close

evgmik added 5 commits February 9, 2021 00:04

changing badger DB version v1.6.2 -> v3.2011.1

6ad08ac

This upgrades the versions and removes options which were removed upstream.

updating go.sum hashes

fad1d81

Merge branch 'develop' into feature/badger-db-v3

a051c06

setting appropriate DB options and removing obsolete options

ca81548

removed unused package

b2e09cc

evgmik requested a review from sahib February 12, 2021 04:52

evgmik added 5 commits February 12, 2021 21:18

Merge branch 'fix/close-DBs-on-Quit' into feature/badger-db-v3

d5d3485

new option setting WithNumVersionsToKeep(1) and reformatting

d2b8f3a

Merge branch 'develop' into feature/badger-db-v3, insisting on badger v3

29b2d9e

added working badger GC, which closes and reopen DB

77b7a7e

Merge branch 'develop' into feature/badger-db-v3

a7d4e8d

sahib requested changes Feb 15, 2021

View reviewed changes

evgmik added 2 commits February 15, 2021 20:46

spelling typo fixed

a75cdaa

sahib reviewed Feb 16, 2021

View reviewed changes

evgmik added 2 commits February 16, 2021 22:13

prevent Closing DB if we still trasaction open, trying to reopen DB i…

91d4826

…f we failed a few times

Added GC elapse time report

bcc0c62

evgmik added 2 commits February 16, 2021 23:46

added tracker of DB updates time.

f7ae738

Run GC only after some DB update with a delay to give time for GC relevant stats to be updated by `badger` This prevent doing useless DB Close/Open for the cases when we know for sure that no changes in DB were done after last garbage collection.

gofmt

f8703bb

sahib approved these changes Feb 17, 2021

View reviewed changes

Retring DB closing several times before reporting failure

5e5821c

extra change, retry loop done with ``` for retries := 0; retries < 10; retries++ { ```

evgmik merged commit dcbd41a into develop Feb 18, 2021

evgmik mentioned this pull request Feb 18, 2021

New database for brig #92

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/badger db v3 #96

Feature/badger db v3 #96

evgmik commented Feb 12, 2021 •

edited

Loading

evgmik commented Feb 15, 2021

sahib commented Feb 15, 2021

sahib Feb 15, 2021

evgmik Feb 16, 2021

sahib Feb 15, 2021

evgmik Feb 16, 2021

sahib Feb 16, 2021

sahib Feb 15, 2021

sahib Feb 15, 2021

evgmik Feb 18, 2021

sahib left a comment

sahib Feb 16, 2021

evgmik Feb 17, 2021

sahib Feb 17, 2021 •

edited

Loading

evgmik Feb 18, 2021

sahib Feb 16, 2021

evgmik commented Feb 17, 2021 •

edited

Loading

evgmik commented Feb 18, 2021

Feature/badger db v3 #96

Feature/badger db v3 #96

Conversation

evgmik commented Feb 12, 2021 • edited Loading

evgmik commented Feb 15, 2021

sahib commented Feb 15, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sahib left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sahib Feb 17, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

evgmik commented Feb 17, 2021 • edited Loading

evgmik commented Feb 18, 2021

evgmik commented Feb 12, 2021 •

edited

Loading

sahib Feb 17, 2021 •

edited

Loading

evgmik commented Feb 17, 2021 •

edited

Loading