lightning/backend/local: use batch write instead of ingest on overlap #857

kennytm · 2021-03-11T21:00:21Z

What problem does this PR solve?

Local backend's local writer always use Ingest SST to copy the content into Pebble. If the SST overlaps with existing data in Pebble, it will be placed at a higher level, eventually placing all at L0. This greatly increases read amplification and makes Flush() (a prerequisite to disk quota) extremely slow.

What is changed and how it works?

If the SST to be ingested overlaps with some existing data, we switch to use Batch Write instead so Pebble can better rearrange the data.

Replaces #843.

We enabled WAL to ensure a flushed Batch Write survives even if Lightning crashes.

Check List

Tests

Unit test

Code changes

Side effects

Related changes

Need to cherry-pick to the release branch

Release Note

No release note

glorv · 2021-03-13T05:10:19Z

pkg/lightning/backend/local.go

@@ -608,7 +608,6 @@ func (local *local) openEngineDB(engineUUID uuid.UUID, readOnly bool) (*pebble.D
 		L0CompactionThreshold: local.maxOpenFiles / 2,
 		L0StopWritesThreshold: local.maxOpenFiles / 2,
 		MaxOpenFiles:          local.maxOpenFiles,
-		DisableWAL:            true,


Why remove this

"We enabled WAL to ensure a flushed Batch Write survives even if Lightning crashes."

it is in the PR description 😓

I think the WAL is to ensure a commited batch can be recovered even if it's not flushed. Without WAL, a flushed batch can still survive even if lightning crashes?🤔

BTW I'd like to know the performance regression with WAL on and off. If it's non-trivial, I suspect is it worth we pay this price only to address the rare crash.

glorv · 2021-03-13T05:21:47Z

pkg/lightning/backend/local.go

@@ -1884,7 +1883,24 @@ func (w *LocalWriter) writeKVsOrIngest(desc localIngestDescription) error {
 		}
 	}

-	// if write failed only because of unorderedness, we immediately ingest the memcache.
+	// if write failed because of unorderedness, we immediately ingest or write the memcache.


Since we have disabled the implicit compaction, the main difference between ingest and write-batch approach is the memcache size. For ingest, it's 128MB; and For write-batch it's 512MB. So ingest approach will generated 3 times more sst files than wb. This is likely a temporary solution. Maybe a simpler way is to use ingest for data engine and write batch for index engine.

Because the write-base approach is proved to be very slow, especially when there is only one table will a lot of indexes. So I don't think this change can address the performance problem. Shall we wait for #753 with the manual compaction?

Maybe a simpler way is to use ingest for data engine and write batch for index engine.

so you mean just remove the overlap check?

So I don't think this change can address the performance problem.

this PR addresses the problem that (during Disk Quota) FlushAllEngines stuck because of having 17000 L0 files. i doubt #753 has tested that scenario.

But the write-batch approach still have this problem, but because it's batch-size is larger, it this sceniro, the L0 SST files should be 17000 / 4 , this number is still very big. Without compaction (either manually or implicit) we can't fully address this problem. And the write-batch is slow due to skip-list insert.

We can accept this pr, but this is a temp solution. #753 uses manual compaction to control the SST file number. With 10TB source files size, it can ensure the total SST files are no more than 3k.

BTY, I'm afraid this approach will introduce performance regression because for each write-batch it needs to first check the db iterator and then fall back to the write-batch solution if the overlap is somewhat severe

If we write all data by writebatch in index-engine, we may be blocked by flush-thread because all data must be flushed by only thread even if there are big data writing into index engine.

Maybe we could split data of source into more than 4 index-engine and just merge all result of them by a merge iterator?

ti-chi-bot · 2021-03-24T16:15:46Z

@kennytm: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

glorv · 2021-07-14T03:24:27Z

Close outdated pr.

lightning/backend/local: use batch write instead of ingest on overlap

f05345b

kennytm added the component/import label Mar 11, 2021

glorv reviewed Mar 13, 2021

View reviewed changes

ti-chi-bot added the needs-rebase label Mar 24, 2021

glorv closed this Jul 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lightning/backend/local: use batch write instead of ingest on overlap #857

lightning/backend/local: use batch write instead of ingest on overlap #857

kennytm commented Mar 11, 2021

glorv Mar 13, 2021

kennytm Mar 13, 2021

glorv Mar 13, 2021

glorv Mar 13, 2021

kennytm Mar 13, 2021

glorv Mar 13, 2021 •

edited

Loading

glorv Mar 22, 2021

Little-Wallace Mar 22, 2021

Little-Wallace Mar 22, 2021

ti-chi-bot commented Mar 24, 2021

glorv commented Jul 14, 2021

lightning/backend/local: use batch write instead of ingest on overlap #857

lightning/backend/local: use batch write instead of ingest on overlap #857

Conversation

kennytm commented Mar 11, 2021

What problem does this PR solve?

What is changed and how it works?

Check List

Release Note

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glorv Mar 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ti-chi-bot commented Mar 24, 2021

glorv commented Jul 14, 2021

glorv Mar 13, 2021 •

edited

Loading