Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deleting values and running GC doesn't reclaim space #767

Closed
magik6k opened this issue Apr 12, 2019 · 12 comments · Fixed by #778
Closed

Deleting values and running GC doesn't reclaim space #767

magik6k opened this issue Apr 12, 2019 · 12 comments · Fixed by #778
Assignees

Comments

@magik6k
Copy link

magik6k commented Apr 12, 2019

I was trying to get Badger GC in go-ipfs to reclaim space, but it didn't seem to work, so I've wrote this rather basic test case to see if it works in the simple case of keys being added, then deleted, and GC ran:

https://gist.github.com/magik6k/8c379cc02b443495e4809170fb8803a9

EDIT: gist / results updated as I discovered that I was calling Delete on wrong data...

These are my (reproducible) results:
DiscardRatio=0.5

non-deletable put: 101 MB(101 MB); sst: 0 B; vlog: 101 MB
data put: 3.1 GB(3.1 GB); sst: 0 B; vlog: 3.1 GB
put closed: 3.1 GB(3.1 GB); sst: 20 MB; vlog: 3.1 GB
del-open: 3.1 GB(3.1 GB); sst: 20 MB; vlog: 3.1 GB
after del: 3.2 GB(3.2 GB); sst: 20 MB; vlog: 3.1 GB
gc-step: 3.2 GB(3.2 GB); sst: 20 MB; vlog: 3.1 GB
gc: 3.2 GB(3.2 GB); sst: 20 MB; vlog: 3.1 GB
close-gc: 3.2 GB(3.2 GB); sst: 36 MB; vlog: 3.1 GB // expected ~100-120M

Cases below are wrong because of a bug in my code:

DiscardRatio=0.01

non-deletable put: 101 MB(101 MB); sst: 0 B; vlog: 101 MB //put 100M, looks good
data put: 3.1 GB(3.1 GB); sst: 0 B; vlog: 3.1 GB // put 3G, looks good
put closed: 3.1 GB(3.1 GB); sst: 20 MB; vlog: 3.1 GB // close DB, looks good
del-open: 3.1 GB(3.1 GB); sst: 20 MB; vlog: 3.1 GB // reopen db, looks good
after del: 9.1 GB(9.1 GB); sst: 3.0 GB; vlog: 6.1 GB // delete the 3G keys put before, sst explodes, vlog grows by 3G
gc-step: 9.1 GB(9.1 GB); sst: 3.0 GB; vlog: 6.1 GB // call RunValueLogGC, nothing changes
gc: 9.1 GB(9.1 GB); sst: 3.0 GB; vlog: 6.1 GB // ErrNoRewrite, even with 3 mostly rewritable vlogs
close-gc: 9.2 GB(9.2 GB); sst: 3.0 GB; vlog: 6.1 GB // closing doesn't change anything

DiscardRatio=0.5

non-deletable put: 101 MB(101 MB); sst: 0 B; vlog: 101 MB
data put: 3.1 GB(3.1 GB); sst: 0 B; vlog: 3.1 GB
put closed: 3.1 GB(3.1 GB); sst: 20 MB; vlog: 3.1 GB
del-open: 3.1 GB(3.1 GB); sst: 20 MB; vlog: 3.1 GB
after del: 9.1 GB(9.1 GB); sst: 3.0 GB; vlog: 6.1 GB
gc-step: 8.0 GB(8.0 GB); sst: 3.0 GB; vlog: 5.0 GB
gc-step: 7.0 GB(7.0 GB); sst: 3.0 GB; vlog: 4.0 GB
gc-step: 7.0 GB(7.0 GB); sst: 3.0 GB; vlog: 4.0 GB
gc: 7.0 GB(7.0 GB); sst: 3.0 GB; vlog: 4.0 GB
close-gc: 7.0 GB(7.0 GB); sst: 3.0 GB; vlog: 4.0 GB // definitely more than 100M

DiscardRatio=0.9

non-deletable put: 101 MB(101 MB); sst: 0 B; vlog: 101 MB
data put: 3.1 GB(3.1 GB); sst: 0 B; vlog: 3.1 GB
put closed: 3.1 GB(3.1 GB); sst: 20 MB; vlog: 3.1 GB
del-open: 3.1 GB(3.1 GB); sst: 20 MB; vlog: 3.1 GB
after del: 9.1 GB(9.1 GB); sst: 3.0 GB; vlog: 6.1 GB
gc-step: 8.0 GB(8.0 GB); sst: 3.0 GB; vlog: 5.0 GB
gc-step: 8.0 GB(8.0 GB); sst: 3.0 GB; vlog: 5.0 GB
gc: 8.0 GB(8.0 GB); sst: 3.0 GB; vlog: 5.0 GB
close-gc: 8.1 GB(8.1 GB); sst: 3.0 GB; vlog: 5.0 GB
```</s>

@jarifibrahim
Copy link
Contributor

@magik6k Thanks for reporting this. I ran your test script and it looks like the GC didn't work (even with 0.01 discard ratio). Let me dig deeper and get back.

@jarifibrahim
Copy link
Contributor

@magik6k Looks like we have a bug in the WriteBatch API due to which the GC cannot reclaim space. However, if you inserts entries using the txn API the GC should work. I made the following changes to your script and garbage collection was able to reclaim 2 GB of space.

diff --git a/main_test.go b/main_test.go
index 0d4099b..8759f9b 100644
--- a/main_test.go
+++ b/main_test.go
@@ -13,6 +13,7 @@ import (
 
 	"github.com/dustin/go-humanize"
 	ds "github.com/ipfs/go-datastore"
+	"github.com/stretchr/testify/require"
 
 	"github.com/dgraph-io/badger"
 )
@@ -44,25 +45,29 @@ func TestGc(t *testing.T) {
 
 	r := rand.New(rand.NewSource(555))
 
-	wb := db.NewWriteBatch()
+	txn := db.NewTransaction(true)
 
 	for i := 0; i < preC; i++ { // put non-deletable entries
 		b, err := ioutil.ReadAll(io.LimitReader(r, entryS))
 		if err != nil {
 			t.Fatal(err)
 		}
-		if err := wb.Set(ds.RandomKey().Bytes(), b, 0); err != nil {
+		if err := txn.Set(ds.RandomKey().Bytes(), b); err != nil {
 			t.Fatal(err)
 		}
+		if int64(i)%1000 == 0 {
+			require.NoError(t, txn.Commit())
+			txn = db.NewTransaction(true)
+		}
 	}
 
-	if err := wb.Flush(); err != nil {
+	if err := txn.Commit(); err != nil {
 		t.Fatal(err)
 	}
 
 	pds(t, "non-deletable put")
 
-	wb = db.NewWriteBatch()
+	txn = db.NewTransaction(true)
 	es := make([][]byte, entryC)
 	for i := 0; i < entryC; i++ { // put deletable entries
 		b, err := ioutil.ReadAll(io.LimitReader(r, entryS))
@@ -70,12 +75,19 @@ func TestGc(t *testing.T) {
 			t.Fatal(err)
 		}
 		es[i] = ds.RandomKey().Bytes()
-		if err := wb.Set(es[i], b, 0); err != nil {
+		if err := txn.Set(es[i], b); err != nil {
 			t.Fatal(err)
 		}
+
+		if int64(i)%1000 == 0 {
+			if err := txn.Commit(); err != nil {
+				t.Fatal(err)
+			}
+			txn = db.NewTransaction(true)
+		}
 	}
 
-	if err := wb.Flush(); err != nil {
+	if err := txn.Commit(); err != nil {
 		t.Fatal(err)
 	}
 
@@ -94,13 +106,24 @@ func TestGc(t *testing.T) {
 
 	pds(t, "del-open")
 
-	wb = db.NewWriteBatch()
-	for _, e := range es {
-		if err := wb.Delete(e); err != nil {
+	txn = db.NewTransaction(true)
+	for i, e := range es {
+		if err := txn.Delete(e); err != nil {
 			t.Fatal(err)
 		}
+		if int64(i)%1000 == 0 {
+			if err := txn.Commit(); err != nil {
+				t.Fatal(err)
+			}
+			txn = db.NewTransaction(true)
+		}
+	}
+	if err := txn.Commit(); err != nil {
+		t.Fatal(err)
 	}
-	if err := wb.Flush(); err != nil {
+	db.Close()
+	db, err = badger.Open(opts)
+	if err != nil {
 		t.Fatal(err)
 	}
 

NOTE - It is important that the db is closed and reopened. We perform compaction when the DB is closed. Without compaction the GC wouldn't be able to free up the space. Compaction happens automatically but in this case since the data isn't enough for compaction to be triggered, we force compaction by closing the DB.

This is what I get on running the script above

$ go test -v github.com/jarifibrahim/foo -run TestGc
=== RUN   TestGc
non-deletable put: 101 MB(101 MB); sst: 0 B; vlog: 101 MB
data put: 3.1 GB(3.1 GB); sst: 0 B; vlog: 3.1 GB
put closed: 3.1 GB(3.1 GB); sst: 20 MB; vlog: 3.1 GB
del-open: 3.1 GB(3.1 GB); sst: 20 MB; vlog: 3.1 GB
after del: 3.2 GB(3.2 GB); sst: 17 MB; vlog: 3.1 GB
gc-step: 2.2 GB(2.2 GB); sst: 17 MB; vlog: 2.2 GB
gc-step: 1.1 GB(1.1 GB); sst: 17 MB; vlog: 1.1 GB
gc-step: 1.1 GB(1.1 GB); sst: 17 MB; vlog: 1.1 GB
gc: 1.1 GB(1.1 GB); sst: 17 MB; vlog: 1.1 GB
close-gc: 1.1 GB(1.1 GB); sst: 17 MB; vlog: 1.1 GB
--- PASS: TestGc (38.71s)
PASS


$ du -lh /tmp/badger/*
1.1G	/tmp/badger/000002.vlog
7.6M	/tmp/badger/000003.vlog
17M	/tmp/badger/000006.sst
4.0K	/tmp/badger/MANIFEST

The 00002.vlog files wasn't removed because it might not have enough entries to discard (we have set the discard ratio to 0.5). The other file 00003.vlog was not removed because it is the latest in-use vlog file (it has the head marker). We never remove vlog file with the latest head marker.

@magik6k
Copy link
Author

magik6k commented Apr 23, 2019

Is there a way to trigger compaction without closing the database?

@jarifibrahim
Copy link
Contributor

Try running Flatten

badger/db.go

Lines 1185 to 1191 in 1fcc96e

// Flatten can be used to force compactions on the LSM tree so all the tables fall on the same
// level. This ensures that all the versions of keys are colocated and not split across multiple
// levels, which is necessary after a restore from backup. During Flatten, live compactions are
// stopped. Ideally, no writes are going on during Flatten. Otherwise, it would create competition
// between flattening the tree and new tables being created at level zero.
func (db *DB) Flatten(workers int) error {

jarifibrahim pushed a commit to jarifibrahim/badger that referenced this issue Apr 23, 2019
Every Transaction stores the latest value of readTs it is aware of. When the transaction is discarded (which happens even when we commit), the global value of readTs is updated.
Previously, the readTs of transaction inside the write batch struct was set to 0. So the global value of readTs would also be set to 0 (unless someone ran a transaction after using write batch).
Due to the 0 value of the global readTs, the compaction algorithm would skip all the values. With this commit, the compaction algorithm works fine with key-values inserted via Transaction API or
via the Write Batch API.

See https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/levels.go#L480-L484 and dgraph-io#767
jarifibrahim pushed a commit to jarifibrahim/badger that referenced this issue Apr 23, 2019
Every Transaction stores the latest value of readTs it is aware of. When the transaction is discarded (which happens even when we commit), the global value of readTs is updated.
Previously, the readTs of transaction inside the write batch struct was set to 0. So the global value of readTs would also be set to 0 (unless someone ran a transaction after using write batch).
Due to the 0 value of the global readTs, the compaction algorithm would skip all the values. With this commit, the compaction algorithm works fine with key-values inserted via Transaction API or
via the Write Batch API.

See https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/levels.go#L480-L484 and dgraph-io#767
jarifibrahim pushed a commit to jarifibrahim/badger that referenced this issue Apr 23, 2019
Every Transaction stores the latest value of `readTs` it is aware of. When the transaction is discarded (which happens even when we commit), the global value of `readMark` is updated.
https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/txn.go#L501-L503
Previously, the `readTs` of transaction inside the write batch struct was set to 0. So the global value of `readMark` would also be set to 0 (unless someone ran a transaction after using write batch).
Due to the 0 value of the global `readMark`, the compaction algorithm would skip all the values which were inserted in the write batch call.
https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/levels.go#L480-L484
and
https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/txn.go#L138-L145
The `o.readMark.DoneUntil()` call would always return `0` and so the compaction wouldn't compact the newer values.

With this commit, the compaction algorithm works as expected with key-values inserted via Transaction API or via the Write Batch API.

See dgraph-io#767
jarifibrahim pushed a commit to jarifibrahim/badger that referenced this issue Apr 23, 2019
Every Transaction stores the latest value of `readTs` it is aware of. When the transaction is discarded (which happens even when we commit), the global value of `readMark` is updated.
https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/txn.go#L501-L503
Previously, the `readTs` of transaction inside the write batch struct was set to 0. So the global value of `readMark` would also be set to 0 (unless someone ran a transaction after using write batch).
Due to the 0 value of the global `readMark`, the compaction algorithm would skip all the values which were inserted in the write batch call.
https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/levels.go#L480-L484
and
https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/txn.go#L138-L145
The `o.readMark.DoneUntil()` call would always return `0` and so the compaction wouldn't compact the newer values.

With this commit, the compaction algorithm works as expected with key-values inserted via Transaction API or via the Write Batch API.

See dgraph-io#767
jarifibrahim pushed a commit to jarifibrahim/badger that referenced this issue Apr 24, 2019
…er versions of keys during compactions.

Every Transaction stores the latest value of `readTs` it is aware of. When the transaction is discarded (which happens even when we commit), the global value of `readMark` is updated.
https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/txn.go#L501-L503
Previously, the `readTs` of transaction inside the write batch struct was set to 0. So the global value of `readMark` would also be set to 0 (unless someone ran a transaction after using write batch).
Due to the 0 value of the global `readMark`, the compaction algorithm would skip all the values which were inserted in the write batch call.
See https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/levels.go#L480-L484
and
https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/txn.go#L138-L145
The `o.readMark.DoneUntil()` call would always return `0` and so the compaction wouldn't compact the newer values.

With this commit, the compaction algorithm works as expected with key-values inserted via Transaction API or via the Write Batch API.

See dgraph-io#767
jarifibrahim pushed a commit to jarifibrahim/badger that referenced this issue Apr 24, 2019
…er versions of keys during compactions.

Every Transaction stores the latest value of `readTs` it is aware of. When the transaction is discarded (which happens even when we commit), the global value of `readMark` is updated.
https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/txn.go#L501-L503
Previously, the `readTs` of transaction inside the write batch struct was set to 0. So the global value of `readMark` would also be set to 0 (unless someone ran a transaction after using write batch).
Due to the 0 value of the global `readMark`, the compaction algorithm would skip all the values which were inserted in the write batch call.
See https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/levels.go#L480-L484
and
https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/txn.go#L138-L145
The `o.readMark.DoneUntil()` call would always return `0` and so the compaction wouldn't compact the newer values.

With this commit, the compaction algorithm works as expected with key-values inserted via Transaction API or via the Write Batch API.

See dgraph-io#767
jarifibrahim pushed a commit to jarifibrahim/badger that referenced this issue Apr 24, 2019
…er versions of keys during compactions.

Every Transaction stores the latest value of `readTs` it is aware of. When the transaction is discarded (which happens even when we commit), the global value of `readMark` is updated.
https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/txn.go#L501-L503
Previously, the `readTs` of transaction inside the write batch struct was set to 0. So the global value of `readMark` would also be set to 0 (unless someone ran a transaction after using write batch).
Due to the 0 value of the global `readMark`, the compaction algorithm would skip all the values which were inserted in the write batch call.
See https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/levels.go#L480-L484
and
https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/txn.go#L138-L145
The `o.readMark.DoneUntil()` call would always return `0` and so the compaction wouldn't compact the newer values.

With this commit, the compaction algorithm works as expected with key-values inserted via Transaction API or via the Write Batch API.

See dgraph-io#767
jarifibrahim pushed a commit to jarifibrahim/badger that referenced this issue Apr 29, 2019
…d older versions of keys during compactions.

Every Transaction stores the latest value of `readTs` it is aware of. When the transaction is discarded (which happens even when we commit), the global value of `readMark` is updated.
https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/txn.go#L501-L503
Previously, the `readTs` of transaction inside the write batch struct was set to 0. So the global value of `readMark` would also be set to 0 (unless someone ran a transaction after using write batch).
Due to the 0 value of the global `readMark`, the compaction algorithm would skip all the values which were inserted in the write batch call.
See https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/levels.go#L480-L484
and
https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/txn.go#L138-L145
The `o.readMark.DoneUntil()` call would always return `0` and so the compaction wouldn't compact the newer values.

With this commit, the compaction algorithm works as expected with key-values inserted via Transaction API or via the Write Batch API.

See dgraph-io#767
jarifibrahim pushed a commit to jarifibrahim/badger that referenced this issue Apr 30, 2019
…d older versions of keys during compactions.

Every Transaction stores the latest value of `readTs` it is aware of. When the transaction is discarded (which happens even when we commit), the global value of `readMark` is updated.
https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/txn.go#L501-L503
Previously, the `readTs` of transaction inside the write batch struct was set to 0. So the global value of `readMark` would also be set to 0 (unless someone ran a transaction after using write batch).
Due to the 0 value of the global `readMark`, the compaction algorithm would skip all the values which were inserted in the write batch call.
See https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/levels.go#L480-L484
and
https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/txn.go#L138-L145
The `o.readMark.DoneUntil()` call would always return `0` and so the compaction wouldn't compact the newer values.

With this commit, the compaction algorithm works as expected with key-values inserted via Transaction API or via the Write Batch API.

See dgraph-io#767
jarifibrahim pushed a commit that referenced this issue May 2, 2019
…d older versions of keys during compactions.

Every Transaction stores the latest value of `readTs` it is aware of. When the transaction is discarded (which happens even when we commit), the global value of `readMark` is updated.
https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/txn.go#L501-L503
Previously, the `readTs` of transaction inside the write batch struct was set to 0. So the global value of `readMark` would also be set to 0 (unless someone ran a transaction after using write batch).
Due to the 0 value of the global `readMark`, the compaction algorithm would skip all the values which were inserted in the write batch call.
See https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/levels.go#L480-L484
and
https://github.com/dgraph-io/badger/blob/1fcc96ecdb66d221df85cddec186b6ac7b6dab4b/txn.go#L138-L145
The `o.readMark.DoneUntil()` call would always return `0` and so the compaction wouldn't compact the newer values.

With this commit, the compaction algorithm works as expected with key-values inserted via Transaction API or via the Write Batch API.

See #767
@jarifibrahim
Copy link
Contributor

@magik6k with #778, the script in https://gist.github.com/magik6k/8c379cc02b443495e4809170fb8803a9 would still produce similar results but the values would be removed eventually. The GC would reclaim space after some time.

@linxGnu
Copy link

linxGnu commented May 29, 2019

Hello @jarifibrahim

I have the same problem and got disk leak while trying to delete keys/values using batch.

The commit: d98dd68#diff-42ea5667b327bb207485077410d5f499
revert your previous commit and problem still remains.

How about reopen this issue?

@jarifibrahim
Copy link
Contributor

jarifibrahim commented May 29, 2019

@linxGnu The value log GC isn't supposed to reclaim space immediately. The change in #778 was reverted because we had issues with it.

The issue here isn't with the GC, it's with the Write Batch API. You need not worry about GC. It will eventually clean up the space. There are multiple factors involved when it tries to find a vlog file to clean.

Take a look at the following script. It works perfectly fine https://gist.github.com/jarifibrahim/78621293e68dffbc30be860f3c9df549#file-main_test-go and it's output https://gist.github.com/jarifibrahim/78621293e68dffbc30be860f3c9df549#file-output-txt.

I am not sure if this is an actual bug. I mean the GC did reclaim space. It just didn't do it immediately.

@linxGnu
Copy link

linxGnu commented May 30, 2019

@jarifibrahim Thank you very much for your details. I would take a look again and report if I still seeing disk not reclaim when using Write Batch API 👍

@jarifibrahim
Copy link
Contributor

@linxGnu Just to help you understand how GC works --

  1. We store Discard Stats which contains how much data can be discarded in each vlog file. The discard stats are built when compactions happen. (You can force compaction by closing and opening the DB)
  2. In case the discard stats does not have information about a specific file, we perform sampling on the file and based on the sample we decide if it should be cleaned up.

Value Log GC is supposed to clean up space eventually. There might be cases when GC doesn't clean up the data, but it will be cleaned up eventually.

evgmik added a commit to sahib/brig that referenced this issue Feb 16, 2021
Reading badger DB issues list, yielded the following.
RunValueLogGC() does clean up online. But on small databases (150MB) is
not big enough, the only way to update stats for GC is to close DB.
see Note in dgraph-io/badger#767 (comment)

Based on this information the logic is redone, to call Close only if
RunValueLogGC did not succeed.
manishrjain pushed a commit to outcaste-io/outserv that referenced this issue Jul 6, 2022
…d older versions of keys during compactions.

Every Transaction stores the latest value of `readTs` it is aware of. When the transaction is discarded (which happens even when we commit), the global value of `readMark` is updated.
https://github.com/dgraph-io/badger/blob/ef05d3439792607477618c0164d9a6e977f43a63/txn.go#L501-L503
Previously, the `readTs` of transaction inside the write batch struct was set to 0. So the global value of `readMark` would also be set to 0 (unless someone ran a transaction after using write batch).
Due to the 0 value of the global `readMark`, the compaction algorithm would skip all the values which were inserted in the write batch call.
See https://github.com/dgraph-io/badger/blob/ef05d3439792607477618c0164d9a6e977f43a63/levels.go#L480-L484
and
https://github.com/dgraph-io/badger/blob/ef05d3439792607477618c0164d9a6e977f43a63/txn.go#L138-L145
The `o.readMark.DoneUntil()` call would always return `0` and so the compaction wouldn't compact the newer values.

With this commit, the compaction algorithm works as expected with key-values inserted via Transaction API or via the Write Batch API.

See dgraph-io/badger#767
@ashish314
Copy link

Hello @jarifibrahim,

I have a couple of questions regarding BadgerDB's garbage collection and file deletion process:

Does BadgerDB delete files automatically, or do users need to call RunValueLogGC at intervals to delete discarded files after compaction?

Is there any way we can obtain information about which files will be deleted during the process of RunValueLogGC before they are actually deleted? This would be helpful if we want to store this data to a cheaper storage solution for backup or archiving purposes.

I would appreciate any insights or guidance you can provide on these topics. Thank you!

@mangalaman93
Copy link
Contributor

Hi @ashish314,

If you want to take a backup of data in badger, you should use the Backup APIs in badger. You can see more details here https://dgraph.io/docs/badger/get-started/#database-backup/

I am not sure if you need to call RunValueLogGC periodically, it seems to me that you should.

@jarifibrahim
Copy link
Contributor

HI @ashish314!

Does BadgerDB delete files automatically, or do users need to call RunValueLogGC at intervals to delete discarded files after compaction?

Badger will perform cleanup automatically. This means it will delete old data and files automatically.

Is there any way we can obtain information about which files will be deleted during the process of RunValueLogGC before they are actually deleted? This would be helpful if we want to store this data to a cheaper storage solution for backup or archiving purposes.

We don't expose this information but you shouldn't need to worry about this data. GC removes only deleted/expired/duplicate/stale data. All the useful data is kept as it is.

You can take periodic backups of your data if you'd like using the backup API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

5 participants